TheoryTweedie's formulaAwesome ArchitectureCascadedMDMADMD3PMGGMImageBARTVQ-DiffusionLDMWuerstchenStableCascadeBinaryLDMSimpleDiffusionScaling-Law-1Scaling-Law-2Scaling-Law-3RDMU-ViTDiffusion-RWKVDiTU-DiTSwitch-DiTSiTHDiTDiMRFlag-DiTFiTVisionLLaMADiGMambaDiSDiffuSSMZigMaDiMDiMDimbaOthersInfinite-DiffINFDCANEffectiveness and Efficiency Enhancementhybrid generative modelD2CDDGANDiffuseVAEFM-BoostingPDMrefinement of network architecturesLDMLiteVAEWuerstchenSimpleDiffusionBK-SDMKOALAHDiTSiTDiGSnapFusionMobileDiffusionSpiking NetworkNASDDSMScaleLongQ-DMDiffuSSMEDM-2ReDistillQuantumpre-trained network compressionquantizationPTQQ-DiffusionSD-PTQLDM-PTQTerDiTPTQ4DiTHQ-DiTStableQBinaryDMAPQ-DMTFMQ-DMEfficientDMEDA-DMMemory-EfficientMixDQCOMQQNCDnetwork pruningDiff-PruningLottery-Ticket-to-DDPMLAPTOP-DiffLD-PrunerLayerMergeToken ReductionToMeAT-EDMToDoenhancement of trainingnovel model familyiDDPMPDFMRectFlowEDMVDMVDM++DiffEncSPDoptimal noise scheduleCosineFixFlawSingDiffusionoptimal loss weightingP2-weightingDebiasSpeeDMulti-Task LearningDual-OutputCas-DMCurriculum LearningDBCLTransfer LearningDiff-TuningFrequency DomainSFUNetMasked DiffusionMaskDMMDTMaskDiTSD-DiTPatchPatchDiffusionPatch-DMnovel diffusion formulaShiftDDPMsContextDiffCARDExposureDiffusionRDDMBeta-DiffusionothersFDMDDDMDeeDiffDMPCompensationCEPSADMConPreDiffTime-Dependentenhancement of samplingoptimal sampling scheduleDPRLAYSLD3Distillation-basedDenoisingStudentDiffusion2GANPDCFG-PDSFDDMSDXL-LightningTRACTSpeedUpNetPDAE-PDImagine-FlashDDGANSIDDMUFOGenYOSOHiPAADDLADDSwiftBrushDISDXSDMDDMD2CMLCMLCM-LoRARG-LCDCTMGCTMTCDTSCDMCMSCottSiDSiD-LSGODE-basedDDIMPNDMDPM-SolverODE-Distillation-basedODE-DistillationCachingDeepCacheUnravelingFaster-DiffusionBlockCaching-DiTTGATEL2CTask-OrientedothersDGDiffRSAMSSkip-TuningMASFExposure BiasTS-DDPMIPDREAMSSMDSSothersManifold ConstraintBayesDiffCADSFPDMDistriFusionLCSCGuidanceClassifierNoisy ClassifierClassifier-FreeDenoising-AssistedAny Distance EstimatorDPSMCGFreeDoMUGCLGDMPGDFIGDTFGEluCDPnPSteered-DiffusionDSGDreamGuiderAutoGuidanceAsymmetric Reverse ProcessAsyrpDecomposed Diffusion SamplingAsymmetric Gradient GuidanceInversionwhyGAN InversionDDIM InversionRegularized DDIM InversionExact InversionParameter-Efficient Fine-TuningLoRAAttnLoRATriLoRASVDiffPETStyleInjectLyCORISOFTBOFTSODASCEditDiffFitDiffscalerText-to-ImageAwesomeVQ-DiffusionGLIDEDALLE-2 (unCLIP)DALL·E-3CogView3SDSDXLSD3WuerstchenStableCascadeImagenYaARTeDiffiRAPHAELPixArt-PixArt-GenTronPanGu-DrawParaDiffusionKNN-DiffusionRDMEnhancementRe-ImagenCADLatent TransparencyLayerDiffAFADiffusion-SoupAsymmetric VQGANCounting GuidanceSAGPAGOnline Self-GuidanceAttention-RegulationAttention-ModulationVPMuLanRPGERNIE-ViLG 2.0TokenComposeDIFFNATITI-GenMaskDiffusionA&ED&BELAINITNOConceptDiffusionDreamWalkLocal-ControlA-STARMulti-Concept T2I-ZeroFreeUEmuSemantic RefinementT2I-SaladBeautifulPromptDPO-DiffConceptSlidersContrastive GuidanceLaVi-BridgeMulti-LoRADiffChatSyntaxStructureDiffusionSG-AdapterMemorizationMemAttnAMGNeMoGuidanceGuidanceIntervalDynamicGuidanceS-CFGWorkingMechanismGuideModelCharacterTCOOneActorSFT/RLAdaDiffImageRewardLVLM-ImageRewardRAHFRLHFRLHFFABRICDPOKDiffusion-DPOCurriculum-DPORLCMTexForceTextCraftorHPSPAHISynArtifactDRaFTAlignPropDRTuneDDPOBoigSDD3PODiffusion-KTOPRDPParrotVersaT2ICoMatDPTSELMALanguageBDMTaiyi-Diffusion-XLPEA-DiffusionAltDiffusionLLMDiffusionResolutionMixture of DiffusersMultiDiffusionStreamMultiDiffusionSyncDiffusionSyncTweediesSSL-guidedCutDiffusionVariable-Size-DiffusionScaleCrafterFouriScaleAny-Size-DiffusionSelf-CascadeDiffCollageElasticDiffusionMagicScrollFiTBeyondScenePersonalizationSubjectTI (direct)CustomDiffusion (direct)DreamBooth (direct)ViCo (direct)HyperDreamBooth (no pseudo word)XTI (direct)ProSpect (direct)NeTI (direct)PerFusion (direct)CrossInitialization (direct)DP (direct)UFC (direct)SID (direct)CLiC (direct)EM-Optimization (direct)PALP (direct)CustomSketching (direct)IP-Adapter (no pseudo word, no test-time fine-tuning)DreamTuner (no pseudo word, no test-time fine-tuning / direct)FreeTuner (no pseudo word, no test-time fine-tuning)SSR-Encoder (no pseudo word, no test-time fine-tuning)Mask-ControlNet (no pseudo word, no test-time fine-tuning)DreamMatcher (direct)DVAR (direct)PACGen (direct)CI (direct)SuDe (direct)ProFusion (direct)DisenBooth (direct)DreamArtist (direct)StyO (direct)DreamDistribution (direct on prompt)SingleInsert (transform)ELITE (transform)E4T (transform)Domain-Agnostic E4T (transform)Cones (direct, no test-time fine-tuning)Cones2 (transform)HiPer (attach)HiFi Tuner (attach)CatVersion (no pseudo word)SuTI (no pseudo word, no test-time fine-tuning)Obeject Encoder (no pseudo word, no test-time fine-tuning)InstantBooth (transform, no test-time fine-tuning)Instruct-Imagen (no pseudo word, no test-time fine-tuning)BootPIG (no pseudo word, no test-time fine-tuning)JeDi (no pseudo word, no test-time fine-tuning)Multi-SubjectBreak-A-Scene (direct)DisenDiff (direct)AttenCraft (direct)Multi-Subject CompositionMix-of-Show (direct)LoRA-Composer (direct)MultiBooth (direct)MC2 (direct)OMG (direct)FreeCustom (no pseudo word, no test-time fine-tuning)OrthoAdaptation (direct)MLoE (direct)Break-for-Make (direct)Cones2 (transform)UMM-Diffusion (transform, no test-time fine-tuning)Subject-Diffusion (transform, no test-time fine-tuning)InstantFamily (no pseudo word, no test-time fine-tuning)SE-Guidance (no pseudo word, no test-time fine-tuning)Concept DiscoveryConceptorConceptLab (direct)DreamCreature (direct)Unsupervised Concepts Discovery (direct)Non-Subject InversionReVersion (direct)Lego (direct)ADI (direct)ViewNeTI (direct)FSViewFusion (direct)CustomDiffusion360Continuous 3D Words (direct)FaceFastComposer (transform, no test-time fine-tuning)DreamIdentity (transform, no test-time fine-tuning)Face2Diffusion (transform, no test-time fine-tuning)PhotoMaker (transform, no test-time fine-tuning)PortraitBooth (transform, no test-time fine-tuning)Arc2Face (transform, no test-time fine-tuning)IDAdapter (transform, no test-time fine-tuning)InstantFamily (no pseudo word, no test-time fine-tuning)DiffSFSR (no pseudo word, no test-time fine-tuning)DemoCaricature (direct)Face Aging (direct)CelebBasis (direct)CharacterFactoryStableIdentity (direct)SeFi-IDE (direct)LCM-Lookahead (no pseudo word, no test-time fine-tuning)InpaintingRealFillPVARestorationPersonalized RestorationX-to-Image (more fine-grained than text-to-image)SketchSKG (sketch + text)SketchAdapter (sketch + text)ToddlerDiffusionLayout/SegmentationIIG (bounding box + text)NoiseCollage (bounding box + text)TriggerPatch (bounding box + text)LayoutDiffuse (bounding box + text)LayoutDiffusion (bounding box + text)PLACE (bounding box + text)MIGC (bounding box + text)B2B (bounding box + text)R&B (bounding box + text)LAW-Diffusion (bounding box + text)SALT (bounding box + text)LayoutLLM-T2I (text -> bounding box -> image)DivCon (text -> bounding box -> image)LLM Blueprint (text -> bounding box -> image)RealCompo (text -> bounding box -> image)Reason out Your Layout (text -> bounding box -> image)SimM (text -> bounding box -> image)GeoDiffusion (bounding box -> text -> image)Directed Diffusion (bounding box + text)Attention Refocusing (bounding box + text)BoxDiff (bounding box/segmentation + text)CAC (bounding box/segmentation + text)SpaText (segmentation + text)EOCNet (segmentation + text)FreestyleNet (segmentation + text)ALDM (segmentation + text)DenseDiffusion (segmentation + text)SCDM (segmentation)MagicMix (layout/style from image/text + text)DiffFashionCompFuser (text)GLoD (layout + text)PoseStablePoseScene GraphDiffuseSG (scene graph)BlobBlobGEN (blob + text)DiffUHaul (blob + layout)ImageIP-Adapter (image + text)Semantica (image)PuLID (image + text)InstantID (image + text)ID-Aligner (image + text)DEADiff (image style + text)Specialist Diffusion (image style + text)VisualStylePrompt (image + text)Prompt-Free Diffusion (image)M2M (image sequence)GeneralLate-Constraint (sketch/edge/segmentation + text)Readout-Guidance (sketch/edge/pose/depth/drag + text)MCM (segmentation/sketch + text)Acceptable Swap-Sampling (concept from text)SCEdit (keypoints/depth/edge/segmentation + text)GLIGEN (image entity/image style/bounding box/keypoints/depth/edge/segmentation + text)ReGround (image entity/image style/bounding box/keypoints/depth/edge/segmentation + text)InteractDiffusion (interaction + text)InstDiff (box/mask/scribble/point + text)ControlNet (edge/segmentation/keypoints + text)ControlNet-XSFineControlNetSmartControlControlNet++X-AdapterCtrl-AdapterMGPFCNC (depth/image/depth and image + text)FreeControl (keypoints/depth/edge/segmentation/mesh + text)T2I-Adapter (edge/segmentation/keypoints + text)BTC (sketch/depth/pose + text)DiffBlender (sketch/depth/edge/box/keypoints/color + text)Universal Guidance (segmentation/detection/face recognition/style + text)Multi-ModalityComposer (shape/semantics/sketch/masking/style/content/intensity/pallete/text)MaxFusionOmniControlNetgControlNetUni-ControlNetFaceComposerAny-to-AnyVersatile DiffusionMulti-SourceUniDiffuserEasyGenCoDiCoDi-2GlueGenIn-Context/Prompt/InstructionUniControlPromptDiffusionContextDiffusionImageBrushInstructGIEAnalogistHuman/HandHumanSDParts2WholeHandRefinerHand2DiffusionHanDiffuserRHanDSText/GlyphTextDiffuserTextDiffuser-2CustomTextGlyphControlGlyphDrawAnyTextUDiffTextBrush Your TextLTOSTextCenGenImage CompositionCollage DiffusionDiff-HarmonizationRecDiffusionPrimeComposerDiffusion in DiffusionTF-ICONComposite DiffusionLRDiffMake-A-StoryboardAnySceneImage Editing through TextSummarizationMDPMask-BasedIIEMaSaFusionMask-FreeText-Guided SDEditLASPA (real image editing, no fine-tune)LaF (real image editing, no fine-tune)P2P (generated image editing, real image editing, no fine-tune)NTI (real image editing)PTI (real image editing)BARET (real image editing)NPI (real image editing)ProxEdit (real image editing)StyleDiffusion (real image editing)DirectInv (real image editing)InfEdit (real image editing)AdapEdit (real image editing)FPE (real image editing)DDS (real image editing)Ground-A-Score (real image editing)DreamSampler (real image editing)SmoothDiffusion (real image editing)IterInv (real image editing)KV-Inversion (real image editing)EDICT (real image editing, no fine-tune)AIDI (real image editing, no fine-tune)SPDInv (real image editing, no fine-tune)FateZero (real image editing, no fine-tune)MasaCtrl (real image editing, no fine-tune)MRGD (real image editing, no fine-tune)Object Variations (generated image editing, no fine-tune)IP2P (real image editing, retrain)Emu Edit (real image editing, retrain)RP2P (real image editing, retrain)PbI (real image editing, retrain)EditWorld (real image editing, retrain)EmoEdit (real image editing, retrain)LIME (real image editing, retrain)FoI (real image editing, retrain)WYS (real image editing, retrain)ZONE (real image editing, retrain)VisII (real image editing, retrain)E4C (real image editing, fine-tune)Imagic (real image editing, fine-tune)Forgedit (real image editing, fine-tune)DBEST (real image editing, fine-tune)PNP (real image editing, no fine-tune)Self-Guidance (real image editing, no fine-tune)Asymmetric Gradient Guidance (real image editing)Asyrp (real image editing)Interpretable h-space (real image edting)ChatFace (real image editing)ZIP (real image editing)Self-Discovering (real image editing)GANTASTIC (real image editing)NoiseCLR (real image editing)Style Disentanglement (real image editing, no fine-tune)SINE (real image editing, fine-tune)SEGA (generated image editing, no fine-tune)DiffEdit (real image editing, no fine-tune)DM-Align (real image editing, no fine-tune)FISEdit (real image editing, no fine-tune)InstDiffEdit (real image editing, no fine-tune)Diff-AE & PDAEDisControlFace (real image editing)UFIEHIVEDialogPaintEMILIEMGIEDVPImage Editing through Reference ImageILVRPbEPbSIMPRINTDreamInpainterPhDRefPaintObjectStitchAnyDoorLAR-GenPAIR-DiffusionCustomNetCustom-EditDreamEditDreamComSpecRefTry-OnTryOnDiffusionStableVITONMMTryonTryOn-AdapterPLTONStableGarmentDTCIDM-VTONWear-Any-WayAnyFitTPDFLDM-VTONShoeModelFaceFaceStudioHS-DiffusionStable-MakeupMLLMBLIP-DiffusionUNIMO-GKosmos-GEmu2GILLTIEImage Editing through Point-based SupervisionSelf-GuidanceDragDiffusionDragNoiseEasyDragStableDragFreeDragDragonDiffusionDiffEditorLucidDragPixel-wise Segmentation GuidanceReadout-GuidanceSDE-DragRotationDragMotion-GuidanceMagic-FixupModel EditingTIMEUCEMACESLDESDACUnlearningFMNPCEConceptPrunePrompt-Tuning-EraseSuppressEOTSepCE4MUAll but OneGeom-ErasingRing-A-BellDiff-QuickFixEraseDiffTVEditioningImage-to-Image TranslationSDEdit (no fine-tune)Inversion-by-Inversion (no fine-tune)UNIT-DDPM (retrain)LaDiffGAN (retrain)CycleNet (retrain)Palette (retrain)DDBM (retrain)DBIM (retrain)ILVR (no fine-tune)DiffusionCLIP (fine-tune)Rectifier (fine-tune)EGSDE (no fine-tune)DDIB (no fine-tune)DECDM (no fine-tune)CycleDiffusion (no fine-tune)DDPM Inversion (no fine-tune)LEDITS (no fine-tune)LEDITS++ (no fine-tune)Pix2Pix-Zero (no fine-tune)CDM (retrain)DiffuseIT (no fine-tune)Few-Shot Diffusion (fine-tune)Fine-grained Appearance Transfer (no fine-tune)S2STFCDiffusion (fine-tune)Style TransferDiffStylerZeConStyleAdapterArtFusionSGDiffStyleDiffusionOSASISCartoonDiffControlStylePortraitDiffuionDitailDiffStyleColorizeDiffusionHiCASTFreeStyleASIInSTArtBankLSASTStyleBoothPair-CustomizationRB-ModulationInverse ProblemDPSMCGDEFTDreamGuiderSTSLDMPlugCI2RMSBDSteered-DiffusionFDEMA&DRestorationNon-BlindDDRMDDNMDDPGIR-SDEDeqIRBlindBlindDPSGDPBIRDFlowIEAutoDIRDiffBIRPromptIRZeroAIRDiff-PluginTIPDecorruptorPromptFixSUPIRFace/HumanPGDiffPFStorerCLR-FaceDiffBodySuper ResolutionSRDiffSR3StableSRResShiftSinSRPatchScalerTregPromptSRCoSeRCasSRSeeSRPASDXPSRSAM-DiffSRSkipDiffECDPFDDifBlindDiffCDFormerInpaintingBlended DiffusionLatentPaintRePaintCoPaintGradPaintTiramisuGLIDEImagenatorStableInpaintingSmartBrushPowerPaintControlNet-InpaintingBrushNetBrush2PromptLoMOEHD-PainterMagicRemoverUni-paintMaGICInpaint AnythingStrDiffusionByteEditSketchInpaintingLazyDiffusionOutpaintingTenPQDiffPBGRepresentation LearningDiff-AESODAPDAEDBAEHDAEDiffuseGAEDisDiffEncDiffCL-DisFDAEDiTiCausalDiffAEObject-Centric LearningDDAE as Self-supervised LearnersDiffMAEMDMStableRepGenPoCCLGenViewSynCLR-SynCLIPDreamDAl-DAEADDPInfoDiffusionRepFusionDe-DiffusionDiffSSLOther TaskshybridInstructDiffusionInstructCVObject DetectionDiffusionDetCamoDiffusionDiffRef3DDiffuBoxMonoDiffSDDGREdge DetectionDiffusionEdgeCorrespondenceDiffMatchCaptionCLIP-Diffusion-LMDiffCapText-only Image CaptioningPrefix-DiffusionLaDiCVisual GroundingPVDDiffusionVGDiffusionVGVisual PredictionDDPAction AnticipationDIFFANTAmodal Segmentationpix2gestaltSegmentationDFormerOVDiffGCDPSemFlowUniGSDiffDASSDGInStyleLDMSegDepthD4RDOptical FlowFlowDiffuserRetrievalDiffusionRetTemporal Action DetectionDiffTADObject TrackingDiffusionTrackDiffusionTrackVideo Moment RetrievalMomentDiffSound Event DetectionDiffSEDKnowledge DistillationDM-KDDiffKDClassificationRDCTiFCiPData AttributionDiffusion AttributionDataset DistillationLD3MD4MOODDiffGuardNODIDiffPathImage Quality AssessmentPFD-IQAeDifFIQADP-IQANR-IQAGenerative UnderstandinghybridDatasetDMDMaaPxDMPSyn-Rep-LearnVermouthGenPerceptClassificationDiffusion ClassificationFGDSImage RetrievalZero-Shot Sketch-based Image RetrievalSketchDiffSketchObject DetectionDiffusionEngineT2I-for-Detection3DiffTectionDetDiffusionSegmentationDDPMSegEmerDiffMaskDiffusionOVAMFreeSeg-DiffDatasetDiffusionDIFFVPDEVPMeta-PromptODISEDiffSegDiffSegmenterAttention as AnnotationSegGenFoBaDiffusionScribble-Supervised Semantic SegmentationOutlineRef LDM-SegScribbleGenGroundingPeekabooGrounded DiffusionGenPrompSemantic CorrespondenceSD complements DINODIFTSD4MatchDiffusion-HyperfeaturesDepth and SaliencyDiffusion Scene RepresentationJointNetECoDepthMulti-Object TrackingTrackDiffusionDiffMOTUnifying Generative and UnderstandingEGCDiffDisFactorized DiffusionOther Interesting PaperUnseenDiffusionIMPUSDiffMorpherAIDNoiseDiffusionBlackScholesDiffusionConcept-centric PersonalizationNeural Network DiffusionFineDiffusionFactorizedDiffusion

Theory

theory

 

Tweedie's formula

One can estimate the mean of a Gaussian distribution, given a random variable zN(μz,Σz) by E(μz|z)=z+Σzzlogp(z)

在diffusion model中,xtN(α¯tx0,(1α¯t)I),所以给定一个xt,对均值α¯tx0的估计为xt+(1α¯t)xtlogq(xt|x0),根据公式xt=α¯tx0+1α¯tϵt,可以推出xtlogq(xt|x0)=ϵt1α¯t

 

Awesome Architecture

Cascaded

Cascaded Diffusion Models for High Fidelity Image Generation

 

MDM

Matryoshka Diffusion Models

Matryoshka-Diffusion

所有resolution一起训练,训练时对每个data的所有resolution的加噪时间步要一样,避免信息泄露,同样使用了SimpleDiffusion提出的noise schedule shift。

类似ProgressiveGAN,从低resolution开始训练,之后逐渐加宽UNet增多loss项数去训练更高resolution,训练高resolution时,低resolution网络也会一起训练。

 

ADM

Diffusion Models Beat GANs on Image Synthesis

AdaGN:将class映射到指定维度,和time embedding相加,通过一个MLP预测ysyb,对GroupNorm的结果做affine:(ysGroupNorm(h)+yb)。

super resolution model:将低分辨率图像x0low上采样到xthigh的尺寸,在channel维度上和xthighconcat在一起,输入到UNet中预测xthigh的噪声,UNet输入6通道,输出3通道。注意对不同txthigh,concat的都是x0low,而不是xtlow

 

D3PM

Structured Denoising Diffusion Models in Discrete State-Spaces

  1. Discrete Diffusion Model

 

GGM

Glauber Generative Model: Discrete Diffusion Models via Binary Classification

  1. D3PM的前向过程每一步对所有token进行随机加噪,GGM每一步只对某个token进行随机加噪,这样逆向过程每一步只需要预测一个token。

  2. 随机一个长度为T的index序列{it}t=0T1,加噪时,让xt+1的除it位置之外的token保持不变,it位置的token进行随机加噪(也有可能保持不变)。训练和采样时使用相同的index序列。

 

ImageBART

ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

  1. VQGAN + multinomial diffusion

  2. Transformer Encoder: xt1,Transformer Decoder: AR shifted prediting xt

ImageBART

 

VQ-Diffusion

Vector Quantized Diffusion Model for Text-to-Image Synthesis

  1. VQVAE + multinomial diffusion

  2. Transformer Blocks:input xt1,cross-attention with text,NAR prediting xt

 

LDM

High-Resolution Image Synthesis with Latent Diffusion Models

AutoEncoder:H×W×3Hf×Wf×Cf=2m ,训练一个AutoEncoder对图像进行适当降维。

固定AutoEncoder,使用UNet训练DDPM建模数据降维后的latent space。好处:减少计算量;不失图像的结构性,仍能发挥UNet的归纳偏置;学到了一个可以更进一步利用的latent space。m一般取4或8最好。

slight regularization:KL or VQ,避免 high-variance latent spaces。

为UNet注入cross-attention层,放在self-attention之后。Q: xt feature map KV: conditions

 

Wuerstchen

StableCascade

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

three-stages to reduce computational demands

  1. stage A:训练4倍降采样率的VQGAN,1024256

  2. Semantic Compressor:将图像从1024 resize到768,训练一个网络将其压缩到16×24×24

  3. stage B:diffusion建模stage A中图像quantize之前的embedding,以图像经过semantic compressor的输出为条件(Wuerstchen还以text为额外条件),相当于self-condition。

  4. stage C:diffusion建模图像经过semantic compressor的输出,以text为条件。

  5. 生成时CBA

 

BinaryLDM

Binary Latent Diffusion

和LDM一样的思路,只是latent是二值化的。

利用VQ的思路,使用Bernoulli采样取代VQ的最近邻,训练一个隐变量二值化的binary AutoEncoder:xRh×w×3y=Sigmoid(E(x))z=Bernoulli(y)z^=sg(z)+ysg(y)x^=D(z^)

推导Bernoulli Diffusion Process,使用DPM建模z的分布。

 

SimpleDiffusion

Simple Diffusion: End-to-end diffusion for high resolution images

SimpleDiffusion

  1. 现有的高分辨率扩散模型有两种,一种是StableDiffusion的降维法,一种是coarse-to-fine的cascade super resolution法。SimpleDiffusion采用以下几个方法解决高分辨率扩散模型的像素空间的直接训练问题。

  2. 改变noise schedule:一个现象是,高分辨率图像x0的加噪图像xt=α¯tx0+1α¯tϵ相比低分辨率的更具有辨识度,这是因为高分辨率图像像素多,表达一个细节的像素数量很多,相邻像素提供了较大的信息冗余度,参考On the Importance of Noise Scheduling for Diffusion Models。这就导致高分辨率图像的扩散模型的扩散过程前轻后重,生成时也必须在较早且较短的时间内构建图像的大体结构,导致训练效果很差。解决方法是使用一个基准分辨率64x64,并为其设置一个实验中表现较好的经验SNR函数,然后根据目标分辨率计算相应的SNR函数,得到shifted noise schedule。

  3. 多尺度训练:高分辨率扩散模型的像素空间的直接训练的一个难点在于图像的高频信息(物体边缘等)较难建模,训练loss主要由这部分支配,所以本论文提出多尺度训练loss:Lθd×d(x)=1d2Eϵ,t||Dd×d[ϵ]Dd×d[ϵθ(xt,t)]||22,其中Dd×d[]代表下采样到d×d分辨率,这是因为下采样是线性算子,Dd×d[ϵθ]可以被看成是一个d×d的扩散模型。最终优化:s{32,64,128,,d}1sLθs×s(x)

  4. 为了解决显存和计算问题,在低分辨率的feature map上增加网络深度,即block数量,本论文选择16x16,并且在整个模型最前面加一个下采样层,在最后面加一个上采样层,避免在最高分辨率下做计算。

  5. 只在低分辨率feature map上加dropout。

 

Scaling-Law-1

On the Scalability of Diffusion-based Text-to-Image Generation

  1. For model scaling, we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers.

  2. On the data scaling side, we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency.

 

Scaling-Law-2

Bigger is not Always Better: Scaling Properties of Latent Diffusion Models

  1. When operating under a given inference budget, smaller models frequently outperform their larger equivalents in generating high-quality results.

 

Scaling-Law-3

Computational Tradeoffs in Image Synthesis Diffusion, Masked-Token, and Next-Token Prediction

  1. We recommend diffusion for applications targeting image quality and low latency; and next-token prediction when prompt following or throughput is more important.

 

RDM

Relay Diffusion: Unifying diffusion process across resolutions for image synthesis

RDM-1

RDM-2

  1. cascaded models perform better than end-to-end models under a fair setting,RDM也是用cascaded模式,但是与传统cascaded models不同的是,RDM是在时间步上cascade,这降低了训练和采样的步数。

  2. 高低分辨率都使用EDM formulation,即xt=x+σϵ

  3. 想要在t处relay,高低分辨率分别训练的diffusion model在t处的xt的SNR需要匹配,但是the same noise level on a higher resolution results in a higher Signal-to-Noise Ratio in the frequency domain,因此提出和shifted noise schedule不同的方法Block Noise,在64×64采样一个噪声,然后使用Block Noise上采样到256×256​,这两个噪声加到各自分辨率的图像上后SNR就相同了。

  4. 低分辨率生成的xt和高分辨率的xt还是有gap的,因此高分辨率使用blurring diffusion建模,前向过程不仅加噪,还会对x进行blur,这样就相当于把低分辨率生成的xt的上采样看成是高分辨率图像被blur后的加噪结果。

  5. 注意,blurring diffusion建模时使用Block Noise加噪(不是直接采样256×256的噪声),这样xt上采样后就能直接输入blurring diffusion进行生成。

 

U-ViT

All are Worth Words: A ViT Backbone for Diffusion Models

U-ViT

  1. ViT in Pixel Space

 

Diffusion-RWKV

Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models

Diffusion-RWKV

  1. RWKV brought improvements for standard RNN architecture, which is computed in parallel during training while inference like RNN. It involves enhancing the linear attention mechanism and designing the receptance weight key value (RWKV) mechanism.

 

DiT

Scalable Diffusion Models with Transformers

DiT

  1. ViT in Latent Space

  2. adaLN:LayerNorm中不学习scale和shift,而是额外使用一个MLP(每个block都有)根据timestep和condition预测一个scale和shift。

  3. adaLN-Zero:在skip-connection前再乘一个预测出来的scale,初始化MLP使这个scale的输出都为0,这样整个DiT Block就被初始化为indentity function,有利于训练。

 

U-DiT

U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers

U-DiT

  1. When the encoder processes the input image by downsampling the image as stage-level amounts, the decoder scales up the encoded image from the most compressed stage to input size. At each encoder stage transition, spatial downsampling by the factor of 2 is performed while the feature dimension is doubled as well. Skip connections are provided at each stage transition. The skipped feature is concatenated and fused with the upsampled output from the previous decoder stage, replenishing information loss to decoders brought by feature downsampling.

  2. 类似ToDo,使用token downsampling减少计算量,但并不丢失信息:将N×N的feature降采样(使用depthwise convolution)为4N2×N2的feature,每个feature独自做self-attention,4个self-attention结果拼回N×N,比如4个self-attention结果的第一个元素拼成一个2×2​的grid作为最终结果的第一个元素,总计算量减少14,但并没有信息丢失。Unlike U-Net downsampling, we are not reducing or increasing the number of elements in the feature during the downsampling process. The substitution of downsampled self-attention to full-scale self-attention brings slight improvement in the FID metric despite a significant reduction in FLOPs.

 

Switch-DiT

Switch Diffusion Transformer: Synergizing Denoising Tasks with Sparse Mixture-of-Experts

Switch-DiT

  1. 引入MoE,每个block都使用timestep-based gating network预测一个概率分布,取TopK,这样可以做到参数隔离,缓解不同时间步冲突的问题。

 

SiT

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

  1. design space + DiT

 

HDiT

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

  1. 传统Transformer每一个block计算量都是一样的,这里把UNet思想用在Transformer上,中间层减少token数量,降低运算量。

  1. 同时借鉴SimpleDiffusion,高分辨率少做计算,低分辨率增长增宽。

  1. 这样可以在pixel层面的计算复杂度只随着图像分辨率提升线性增长。

 

DiMR

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models

DiMR

  1. U-ViT

  2. Transformer在计算量和效果之间存在tradeoff,patch size小时,token length长,计算量大,但效果好,patch size大时,token length短,计算量小,但效果差。

  3. feature cascade:分为R个branch,所有branch都是用同一个加噪后的xt。对于第r个branch,使用2Rr×2Rr的卷积降低输入xt的分辨率,该branch的输出上采样为原来的两倍,concat在下一branch的输入上。在每个branch的输出上都计算一个diffusion loss,目标为相同的ϵr​倍的average pooling下采样。

  4. 在低分辨率上使用U-ViT架构,在高分辨率使用ConvNeXt节省计算量。

 

Flag-DiT

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

Flag-DiT-1

Flag-DiT-2

  1. Flag-DiT substitutes all LayerNorm with RMSNorm to improve training stability. Moreover, it incorporates key-query normalization (KQ-Norm) before key-query dot product attention computation. The introduction of KQ-Norm aims to prevent loss divergence by eliminating extremely large values within attention logits.

  2. We introduce learnable special tokens including the [nextline] and [nextframe] tokens to transform training samples with different scales and durations into a unified one-dimensional sequence. We add [PAD] tokens to transform 1-D sequences into the same length for better parallelism. 不同模态的数据都转换为一个1D序列统一建模。

  3. 有text时,self-attention和cross-attention并列。

 

FiT

FiT: Flexible Vision Transformer for Diffusion Model

FiT

  1. ViT in Latent Space

  2. 不做crop,保持长宽比,resize图像使其满足HW2562,VAE下采样8倍,patchify size=2,所以token长度最多为(25682)2=256​,不够256的pad到256,MHSA时mask掉pad token,只在unmask token上计算loss。

  3. 使用2D-RoPE位置编码方法,利用其外推性可以生成任意分辨率和长宽比的图像。

 

VisionLLaMA

VisionLLaMA: A Unified LLaMA Interface for Vision Tasks

  1. ViT in Latent Space

  2. a vision transformer architecture similar to LLaMA to reduce the architectural differences between language and vision.

 

DiG

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

DiG

  1. ViT in Latent Space

  2. DiT models have faced challenges with scalability and quadratic complexity efficiency. We leverage the long sequence modeling capability of Gated Linear Attention (GLA) Transformers, expanding its applicability to diffusion models and offering superior efficiency and effectiveness.

 

Mamba

DiS

Scalable Diffusion Models with State Space Backbone

 

DiffuSSM

Diffusion Models Without Attention

 

ZigMa

ZigMa: Zigzag Mamba Diffusion Model

 

DiM

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

 

DiM

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

 

Dimba

Dimba: Transformer-Mamba Diffusion Models

 

Others

Infinite-Diff

Infinite-Diff: Infinite Resolution Diffusion with Subsampled Mollified States

Infinite-Diff

  1. Infinite-Diff is a generative diffusion model defined in an infinite dimensional Hilbert space, which can model infinite resolution data. By training on randomly sampled subsets of coordinates and denoising content only at those locations, we learn a continuous function for arbitrary resolution sampling.

 

INFD

Image Neural Field Diffusion Models

INFD

  1. Neural field is also known as Implicit Neural Representations (INR), which represents signals as coordinate-based neural networks.

  2. 提出Image Neural Field Autoencoder,目的是有一个隐空间分布可以建模并采样。

  3. 类似Diff-AE和PDAE,使用diffusion model建模隐空间分布。

 

CAN

Condition-Aware Neural Network for Controlled Image Generation

CAN-1

CAN-2

  1. 传统的条件模型中,所有条件共用相同的处理条件的static network,这限制了网络的建模能力。一种解决方案是每个条件使用一个expert model,但消耗极大。因此学习一个生成网络,根据条件动态生成处理条件的网络参数。introduces a condition-aware weight generation module that generates conditional weight for convolution/linear layers based on the input condition.

  2. Making depthwise convolution layers, the patch embedding layer, and the output projection layers condition-aware brings a significant performance boost.

  3. CAN consistently delivers significant improvements for diffusion transformer models, including DiT and UViT, on class conditional image generation on ImageNet and text-to-image generation on COCO.

 

Effectiveness and Efficiency Enhancement

hybrid generative model

D2C

D2C: Diffusion-Decoding Models for Few-Shot Conditional Generation

diffusion建模autoencoder的latent distribution。

 

DDGAN

Tackling the Generative Learning Trilemma with Denoising Diffusion GANs

DDGAN-1 DDGAN-2

 

  1. 解决原始扩散模型不可训练的问题,参考theory第6章。

  2. 步长较大时,q(xt1|xt)就不再是高斯分布,变得更复杂和多峰,所以可以使用对抗学习拟合pθ(xt1|xt),即minθt1Eq(xt)[Dadv(q(xt1|xt)pθ(xt1|xt))],可以转化为同时学习判别器minϕt1Eq(xt){Eq(xt1|xt)[logDϕ(xt1,xt,t)]+Epθ(xt1|xt)[log(1Dϕ(xt1,xt,t))]}和生成器maxθt1Eq(xt)Epθ(xt1|xt)[logDϕ(xt1,xt,t)]

  3. q(xt)q(xt1|xt)=q(xt1,xt)=dx0q(x0,xt1,xt)=dx0q(x0)q(xt1|x0)q(xt|xt1,x0)=dx0q(x0)q(xt1|x0)q(xt|xt1),所以先从x0加噪到xt1,再从xt1加噪到xt

  4. pθ(xt1|xt)=pθ(x0|xt)q(xt1|xt,x0)dx0=p(z)q(xt1|xt,x0=Gθ(z,xt,t))dz,所以使用生成器根据xt1生成x0,再使用q(xt1|xt,x0)采样一个xt1

  5. 使用判别器判别xt1的真假。Note that for different t, xt has different levels of perturbation, and hence using a single network to predict xt1 directly at different t may be difficult. However, in our case the generator only needs to predict unperturbed x0 and then add back perturbation using q(xt1|xt,x0).

  6. DDPMs的逆向过程也可以被解释为pθ(xt1|xt)=q(xt1|xt,x0=xt1α¯tϵθ(xt,t)α¯t),它虽然一直是一个高斯分布,但当步长较大步数较少时效果较差,其和DDGANs的区别在于预测的x0是否为确定性的,由于步长很小时pθ(xt1|xt)是高斯分布,所以其预测的x0是确定性的,而DDGANs步长很大,预测的x0是具有随机性的,因此pθ(xt1|xt)就是多峰的。

  7. GANs are known to suffer from training instability and mode collapse, and some possible reasons include the difficulty of directly generating samples from a complex distribution in one-shot, and the overfitting issue when the discriminator only looks at clean samples. Our model breaks the generation process into several conditional denoising diffusion steps in which each step is relatively simple to model, due to the strong conditioning on xt. Moreover, the diffusion process smoothens the data distribution, making the discriminator less likely to overfit.

 

DiffuseVAE

DiffuseVAE: Efficient, Controllable and High-Fidelity Generation from Low-Dimensional Latents

DiffuseVAE

 

FM-Boosting

Boosting Latent Diffusion with Flow Matching

FM-Boosting

  1. LDM的推理速度随着图像分辨率提高而平方增长。

  2. 使用Flow Matching在上采样的低分辨率的latent和高分辨率的latent之间建模。使用低分辨率LDM进行生成,使用Flow Matching转换到高分辨率。

  3. 一般的Flow Matching是在数据分布和高斯分布之间建模的,这里是在数据对之间建模,所以叫Coupling Flow Matching。

 

PDM

Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model

  1. 类似FM-Boosting的思路逐步生成高分辨率的图像。

 

refinement of network architectures

LDM

High-Resolution Image Synthesis with Latent Diffusion Models

 

LiteVAE

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

LiteVAE

  1. We leverage the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality.

 

Wuerstchen

`Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

 

SimpleDiffusion

Simple Diffusion: End-to-end diffusion for high resolution images

 

BK-SDM

BK-SDM: A Lightweight, Fast, and Cheap Version of Stable Diffusion

  1. compact UNet:fewer blocks in the down and up stages (Base), further removal of the entire mid-stage (Small), further removal of the innermost stages (Tiny).

  2. distillation-based retraining:除了diffusion loss,还可以使用训练好的大网络的StableDiffusion进行output-level distillation(相同输入得到的输出之间的MSE loss)和feature-level distillation(相同输入得到的网络feature之间的MSE loss)。

 

KOALA

KOALA: Fast and Memory-Efficient Latent Diffusion Models via Self-Attention Distillation

KOALA

  1. 和BK-SDM做法一样。

  2. 进一步改进了feature-level distillation,测试了使用不同模块输出的feature进行distillation时的效果,发现对self-attention输出的feature进行distillation时效果最好,而且decoder early blocks位置的self-attention输出的feature效果最好。

 

HDiT

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

 

SiT

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

 

DiG

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

 

SnapFusion

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

 

MobileDiffusion

MobileDiffusion: Subsecond Text-to-Image Generation on Mobile Devices

lightweight model architecture + DiffusionGAN + distillation

 

Spiking Network

Spiking-Diffusion Vector Quantized Discrete Diffusion Model with Spiking Neural Networks

Spiking Denoising Diffusion Probabilistic Models

Fully Spiking Denoising Diffusion Implicit Models

SDiT: Spiking Diffusion Model with Transformer

 

NAS

AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration

Lightweight Diffusion Models with Distillation-Based Block Neural Architecture Search

 

DDSM

Denoising Diffusion Step-aware Models

不同step的重要性是不同的,没必要每一步都使用大模型。

slimmable network: a neural network that can be executed at arbitrary model sizes.

搜索最优的采样策略,不同步使用不同size的模型,减少计算量。

 

ScaleLong

ScaleLong: Towards More Stable Training of Diffusion Model via Scaling Network Long Skip Connection

feature norm的角度理论分析UNet的long skip connection coefficient的影响。

 

Q-DM

Q-DM: An Efficient Low-bit Quantized Diffusion Model

 

DiffuSSM

Diffusion Models Without Attention

 

EDM-2

Analyzing and Improving the Training Dynamics of Diffusion Models

  1. We update all of the operations (e.g., convolutions, activations, concatenation, summation) to maintain magnitudes on expectation.

 

ReDistill

ReDistill: Residual Encoded Distillation for Peak Memory Reduction

  1. Reducing peak memory, which is the maximum memory consumed during the execution of a neural network, is critical to deploy neural networks on edge devices with limited memory budget. We propose residual encoded distillation (ReDistill) for peak memory reduction in a teacher-student framework, in which a student network with less memory is derived from the teacher network using aggressive pooling.

 

Quantum

Quantum Denoising Diffusion Models

Quantum Generative Diffusion Model

Towards Efficient Quantum Hybrid Diffusion Models

量子计算

 

pre-trained network compression

quantization

PTQ

Post-training Quantization on Diffusion Models

 

Q-Diffusion

Q-Diffusion: Quantizing Diffusion Models

 

SD-PTQ

Post-training Quantization with Progressive Calibration and Activation Relaxing for Text-to-Image Diffusion Models

 

LDM-PTQ

Efficient Quantization Strategies for Latent Diffusion Models

 

TerDiT

TerDiT: Ternary Diffusion Models with Transformers

 

PTQ4DiT

PTQ4DiT: Post-training Quantization for Diffusion Transformers

 

HQ-DiT

HQ-DiT: Efficient Diffusion Transformer with FP4 Hybrid Quantization

 

StableQ

StableQ: Enhancing Data-Scarce Quantization with Text-to-Image Data

 

BinaryDM

BinaryDM: Towards Accurate Binarization of Diffusion Model

 

APQ-DM

Towards Accurate Post-training Quantization for Diffusion Models

 

TFMQ-DM

TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models

 

EfficientDM

EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models

 

EDA-DM

Enhanced Distribution Alignment for Post-Training Quantization of Diffusion Models

 

Memory-Efficient

Memory-Efficient Personalization using Quantized Diffusion Model

fine-tune quantized diffusion model

 

MixDQ

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization

 

COMQ

COMQ: A Backpropagation-Free Algorithm for Post-Training Quantization

 

QNCD

QNCD: Quantization Noise Correction for Diffusion Models

 

network pruning

Diff-Pruning

Structural Pruning for Diffusion Models

 

Lottery-Ticket-to-DDPM

Successfully Applying Lottery Ticket Hypothesis to Diffusion Model

 

LAPTOP-Diff

LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compressing Diffusion Models

 

LD-Pruner

LD-Pruner: Efficient Pruning of Latent Diffusion Models using Task-Agnostic Insights

 

LayerMerge

LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging

 

Token Reduction

ToMe

Token Merging: Your ViT But Faster

Token Merging for Fast Stable Diffusion

 

AT-EDM

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

AT-EDM

  1. 根据attention map识别过剩的token进行merge。

 

ToDo

ToDo: Token Downsampling for Efficient Generation of High-Resolution Images

  1. 类似PixArt-Σ的KV compression,但是training-free的。

  2. Tokens in close spatial proximity exhibit higher similarity, thus providing a basis for merging without the extensive com putation of pairwise similarities.

  3. We employ a downsampling function using the Nearest-Neighbor algorithm to the keys and values of the attention mechanism while preserving the original queries.

 

enhancement of training

novel model family

iDDPM

Improved Denoising Diffusion Probabilistic Models

  1. They achieve similar sample quality using either σt2=βt or σt2=β~t which are the upper and lower bounds on the variance given by q(x0) being either isotropic Gaussian noise or a delta function (β~0=0), respectively. We choose to parameterize the variance as an interpolation between βt and β~t in the log domain. Σθ(xt,t)=exp(vlogβt+(1v)logβ~t). We did not apply any constraints on model output v, theoretically allowing the model to predict variances outside of the interpolated range. However, we did not observe the network doing this in practice, suggesting that the bounds for Σθ(xt,t) are indeed expressive enough.

  2. 除了diffusion loss,额外优化0.001Lvlb,we also apply a stop-gradient to the μθ(xt,t) output for the Lvlb​ term.

  3. sampling t uniformly causes unnecessary noise in the Lvlb,根据历史Lt在所有Lt的占比确定t的采样概率。

 

PD

Progressive Distillation for Fast Sampling of Diffusion Models

  1. v-prediction,在理论上和ϵ-prediction是等价的,因为两者可以相互转换。

 

FM

Flow Matching for Generative Modeling

  1. 基于Continuous Normalizing Flows(Neural ODE),CNFs在训练时是先使用模型对数据样本进行转化(ODE simulations),计算转化后的样本与标准高斯分布之间的KL散度,优化模型;flow matching是simulation-free的,因为ODE路径已经提前定义好了。

  2. diffusion model和score based model对xtlogpt(xt)进行拟合,flow matching对ddtxt进行拟合,如果和diffusion model或score based model使用相同的扩散核,那么flow matching在理论上和diffusion model和score based model是等价的(因为vθϵθsθ​​都可以相互转换,类似PD的v-prediction)。we find this training alternative to be more stable and robust in our experiments to existing score matching approaches.

 

RectFlow

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

  1. 可以在任意两个分布之间进行建模。

  2. 随机从两个分布中的选择数据X0X1(不需要成对),使用最简单的插值法Xt=tX1+(1t)X0,此时ddtXt=X1X0,所以损失函数为01E(X1X0)vθ(Xt,t)2dt​。

  3. 每次训练完成后进行采样,得到当前flow下的coupling数据,之后用这些couping数据再进行训练,以此循环对轨迹进行修正,会得到直的没有交叉点的flow。

 

EDM

Elucidating the Design Space of Diffusion-Based Generative Models

  1. xt=x0+ϵϵN(0,σt2I)Dθ(xt,σt)x0

  2. preconditioning:As the input xt is a combination of clean signal and noise, its magnitude varies immensely depending on noise level σt. For this reason, the common practice is to not represent Dθ as a neural network directly, but instead train a different network Fθ from which Dθ can be derived. VE trains Fθ to predict ϵ scaled to unit variance, from which the signal is then reconstructed via Dθ(xt,σt)=xtσFθ(xt,σt). This has the drawback that at large σ, the network needs to fine-tune its output carefully to cancel out the existing noise ϵ exactly and give the output at the correct scale. Note that any errors made by the network are amplified by a factor of σt. In this situation, it would seem much easier to predict the expected output Dθ(xt,σt) directly. To this end, we propose to precondition the neural network with a σ-dependent skip connection that allows it to estimate either x0 or ϵ, or something in between. Dθ(xt,σt)=cskip(σt)xt+cout(σt)Fθ(cin(σt)xt,σt)x0Fθ(cin(σt)xt,σt)1cout(σt)(x0cskip(σt)xt),we choose cin and cout to make the network inputs and training targets have unit variance, and cskip to make the amplifying errors in Fθ as little as possible. Other diffusion models always have cskip=1.

  3. augmentation: To prevent potential overfitting that often plagues diffusion models with smaller datasets, we apply various geometric transformations to a training image prior to adding noise. To prevent the augmentations from leaking to the generated images, we provide the augmentation parameters as a conditioning input to Fθ; during inference we set the them to zero to guarantee that only non-augmented images are generated. (macro conditioning as augmentation)

 

VDM

Variational Diffusion Models

efficient optimization of the noise schedule jointly with the rest of the model

 

VDM++

Understanding Diffusion Objectives as the ELBO with Simple Data Augmentation

 

DiffEnc

DiffEnc: Variational Diffusion with a Learned Encoder

 

SPD

Image generation with shortest path diffusion

 

optimal noise schedule

Cosine

Improved Denoising Diffusion Probabilistic Models

  1. 与linear schedule不同,cosine schedule直接定义α¯t=cos2(t+0.0081.008π2),再计算出βt

 

FixFlaw

Common Diffusion Noise Schedules and Sample Steps are Flawed

FixFlaw-1

FixFlaw-2

  1. StableDiffusion使用的noise schedule最后一步加噪公式为zT=0.068265z0+0.997667ϵ,并不是标准高斯分布,zT仍然包含一些信息,其均值不为0,黑色(-1)的图像均值为负,白色(1)的图像均值为正,在生成时使用随机的zT,均值为0,所以生成的都是中等亮度的图像。随机一个ϵ,使用其给某个图像加噪到zT,从zT开始生成的结果和从ϵ开始生成的结果也不同,参考Magic-Fixup。

  2. 为了解决这个问题,需要保证zero terminal SNR,即αT¯=0,使用rescale schedule方法对现有的noise schedule进行修正,做法是保持α¯1不变,修正αT¯=0,之后对2,,Tαt¯进行rescale,之后再重新训练模型。

  3. 使用了zero terminal SNR的noise schedule后,T时刻的denoising loss就没有意义了,所以建议使用PD提出的v-prediction参数法,vt=α¯tϵ1α¯tx0,这样vT=x0v1=α¯1ϵ1α¯1x0,预测都有了意义。

  4. 结合上述两点,可以使用纠正后的noise schedule和v-prediction方法fine-tune已有的StableDiffusion,效果一致。

  5. Rescale Classifier-Free Guidance:使用zero terminal SNR的noise schedule后,原有的cfg会变得敏感,导致生成图像过曝,所以对cfg结果进行rescale。

 

SingDiffusion

Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models

SingDiffusion-1SingDiffusion-2
  1. 现有的模型不能生成太亮或太暗的图像。

  2. 现有的模型由于没有zero terminal SNR,所以本质上是在(0,1ϵ]上训练的,因此可以微调预训练扩散模型额外训练一步,或者重新训练一个只有最后一步的扩散模型。同FixFlaw一样,由于T时刻的denoising loss没有意义,所以这一步的训练使用x-predicition。

  3. 采样时第一步从高斯分布开始,采样出一个x1ϵ,之后和原来一样。

 

optimal loss weighting

P2-weighting

Perception Prioritized Training of Diffusion Models

 

Debias

Debias the Training of Diffusion Models

类似P2-weighting。

x^0=1α¯txt1α¯tα¯tϵθ(xt,t)=1α¯t(α¯tx0+1α¯tϵ)1α¯tα¯tϵθ(xt,t)=x0+1α¯tα¯t(ϵϵθ(xt,t))=x0+1SNR(t)(ϵϵθ(xt,t))

使用1SNR(t)作为ϵθ(xt,t)ϵ22的weight,使得x^0更接近x0

 

SpeeD

A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

  1. Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergencen areas based on the process increment. ii) These time steps are imbalanced, with many concentrated in the convergence area. iii) The concentrated steps provide limited benefits for diffusion training.

  2. We design an asymmetric time step sampling strategy that reduces the frequency of time steps from the convergence area while increasing the sampling probability for time steps from other areas.

 

Multi-Task Learning

Multi-Architecture Multi-Expert Diffusion Models

Addressing Negative Transfer in Diffusion Models

eDiffi Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Not All Steps are Equal: Efficient Generation with Progressive Diffusion Models

Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture

  1. diffusion model需要处理识别所有level的noise这一特点导致了其模型需要大量参数进行训练。

  2. 不同时间步(expert)用不同的特性网络(architecture),降低学习难度,减少模型参数量。

 

Denoising Task Routing for Diffusion Models

 

Dual-Output

Dynamic Dual-Output Diffusion Models

dual-output

Cas-DM

Bring Metric Functions into Diffusion Models

Cas-DM

 

Curriculum Learning

DBCL

Denoising Task Difficulty-based Curriculum for Training Diffusion Models

  1. diffusion model不同时间步的学习难度是不同的,将时间步等分为20个区间,在每个区间单独训练一个模型(总共20个),考察它们的收敛速度,在loss和生成质量(混合采样,使用一个正常训练好的扩散模型,采样时只在指定区间使用单独训练的模型)方面,都是时间步越大收敛速度越快。

  2. Curriculum Learning:a method of training models in a structured order, starting with easier tasks or examples and gradually increasing difficulty. 所以将时间步分区域后,先从最靠后的区域开始训练,之后依次向靠前的区域训练,每次训练时依然要训练之前区域的时间步,避免遗忘。

  3. 收敛更快,生成效果更好。

 

Transfer Learning

Diff-Tuning

Diffusion Tuning: Transferring Diffusion Models via Chain of Forgetting

  1. Chain of Forgetting:t较小时,diffusion model可以做zero-shot denoising,和数据集无关;t较大时,diffusion model的泛化性受数据集影响较大。

  2. transfer时,同时使用一些原数据集的数据参与训练。使用原数据集的数据时,diffusion loss系数随着t的增大单调递减;使用transfer数据集的数据时,diffusion loss系数随着t的增大单调递增。

 

Frequency Domain

SFUNet

Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models

 

Masked Diffusion

MaskDM

Masked Diffusion Models are Fast Learners

使用U-ViT架构(Pixel Space),最高90%的mask ratio,比DDPM收敛速度快4倍,且生成效果更好。

MaskDM

 

MDT

Masked Diffusion Transformer is a Strong Image Synthesizer

使用DiT架构(Latent Space),为了解决训练和推理时mask不同的distribution shift,训练时使用一个side-interpolater补全masked tokens,比DiT收敛速度快3倍,生成效果更好。

 

MaskDiT

Fast Training of Diffusion Models with Masked Transformers

MaskDiT

  1. 使用DiT架构(Latent Space),DiT encoder可以scaling up,DiT decoder使用固定的8个DiT block。

  2. 只根据visible token预测invisible token的score太困难了,所以将diffusion loss拆分,对于visible token使用diffusion loss,对于invisible token使用对应noisy patches的MSE loss(注意是直接预测加噪的invisible patch,不是预测其噪声或者原图),类似MaskDM + MAE。

  3. 最高50%的mask ratio,比DiT收敛速度快3倍,达到相同的生成效果。

  4. MAE必不可少,如果不加MAE,生成效果降低很多,但MAE的loss的系数如果太大反而会影响生成效果,所以这个系数要精心挑选。Without the MAE reconstruction task, the training easily overfits the local subset of unmasked tokens as it lacks a global understanding of the full image, making the gradient update less informative. 理解辅助生成。

 

SD-DiT

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

SD-DiT

  1. DiT encoder可以scaling up,DiT decoder使用固定的8个DiT block。

  2. DiT decoder的输入插入的不是learnable mask token,而是直接插入invisible patch,diffusion loss在所有patch上计算,而不是像MaskDiT那样只根据visible token去预测invisible patch。

  3. 这样去掉了MAE,没有了理解就无法辅助生成,所以引入self-distilling模块。encoder的每个token处的最后一层输出再经过一个MLP+softmax预测一个K维的分布,以teacher encoder处的预测结果为label计算cross-entropy loss,只计算所有unmask token和class token处的cross-entropy loss。

 

Patch

PatchDiffusion

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

  1. x进行random crop得到xi,j,s,其中i,j为左上角坐标,s为patch size,对xi,j,s​进行加噪训练EDM,i,j,s也作为输入条件。

  2. EDM only sees local patches and may have not captured the global cross-region dependency between local patches, in other words, the learned scores from nearby patches should form a coherent score map to induce coherent image sampling. To resolve this issue, we propose two strategies: 1) random patch sizes and 2) involving a small ratio of full-size images.

  3. 采样时分patch采样后拼在一起。

  4. Through Patch Diffusion, we could achieve 2 faster training, while maintaining comparable or better generation quality.

 

Patch-DM

Patched Denoising Diffusion Models For High-Resolution Image Synthesis

Patch-DM

  1. Rather than using entire complete images for training, our model only takes patches for training and inference and uses feature collage to systematically combine partial features of neighboring patches.

  2. 训练和推导时,第一种方法是将xt分成patch,每个patch输入模型单独预测;第二种方法是将xt分成patch,相邻patch输入模型预测公共部分;第三种方法将第二种方法细化到UNet的feature上。

 

novel diffusion formula

一些具体任务本身就是某种过程,可以设计不同的马尔科夫转移链进行训练。

ShiftDDPMs

ShiftDDPMs: Exploring Conditional Diffusion Models by Shifting Diffusion Trajectories

 

ContextDiff

Cross-Modal Contextualized Diffusion Models for Text-Guided Visual Generation and Editing

 

CARD

CARD: Classification and Regression Diffusion Models

公式类似PriorGrad,diffusion model输出regression的值或者classification的概率。

 

ExposureDiffusion

ExposureDiffusion: Learning to Expose for Low-light Image Enhancement

把照相机图像曝光过程当做一种扩散过程进行建模。

 

RDDM

Residual Denoising Diffusion Models

类似ResShift。

 

Beta-Diffusion

Beta Diffusion

Beta分布,优化KL-divergence upper bounds。

 

others

FDM

Fast Diffusion Model

与SGD建立联系,引入momentum,加快训练和采样。

 

DDDM

Directly Denoising Diffusion Model

DDDM

  1. DDDMs train the diffusion model conditioned on an es timated target that was generated from previous training iterations of its own.

  2. Define fθ(x0,xt,t)=xt+t012β(s)[xsxslogq(xs)]ds as the solution of the VP PF ODE from initial time t to final time 0,使用神经网络表示Fθ(x0,xt,t)=t012β(s)[xsxslogq(xs)]ds,所以fθ(x0,xt,t)=xtFθ(x0,xt,t)。虽然使用了PF ODE,但不需要预训练的score model。

 

DeeDiff

DeeDiff: Dynamic Uncertainty-Aware Early Exiting for Accelerating Diffusion Model Generation

early exiting策略:The basic assumption of early exiting is that the input samples of the test set are divided into easy and hard samples. The computation for easy samples terminates once some conditions are satisfied.

使用U-ViT,在训练时为每层的输出额外训练一个uncertainty estimation module,用于评估当前层作为输出的不确定度。UEM是一个预测标量的MLP,目标是当前层的输出与最后一层的输出的MSE Loss。

推理时,对于采样的每一步,一旦某层的输出的不确定度低于给定阈值,就将该层输出作为最终输出,达到加速的效果。

 

DMP

Diffusion Model Patching via Mixture-of-Prompts

DMP

  1. 为预训练好的DiT的每层block额外训练一组参数pi,其和输入xi维度相同,类似positional embedding一样加在xi上。

  2. The same prompts are used for each block throughout the training, thus they will learn knowledge that is agnostic to denoising stages. To patch the model with stage-specific knowledge, we introduce dynamic gating. This mechanism blends prompts in varying proportions based on the noise level of an input image. 学习一个gating网络,xi=blocki(σ(G([t;i]))pi1+xi1)

 

Compensation

Compensation Sampling for Improved Convergence in Diffusion Models

额外训练一个UNet预测补全项。

 

CEP

Slight Corruption in Pre-training Data Makes Better Diffusion Models

  1. 类似CADS对condition进行操作,但CADS仅在采样时进行操作,二CEP在训练时进行操作。

  2. 初步实验:To introduce synthetic corruption into the conditions, we randomly flip the class label into a random class for IN-1K, and randomly swap the text of two sampled image-text pairs for CC3M. As a result, class and text-conditional models pre-trained with slight corruption achieve significantly lower FID and higher IS and CLIP score. More corruption in pre-training can potentially lead to quality and diversity degradation. As degration level increases, almost all metrics first become better and then degrade. However, the degraded measure with more corruption sometimes is still better than the clean ones.

  3. 更一般的,we propose to directly add perturbation to the conditional embeddings of DMs, which is termed as conditional embedding perturbation。做法是对condition embedding加一个符合N(0,γdI)的高斯噪声。

 

SADM

Structure-Guided Adversarial Training of Diffusion Models

SADM

除了diffusion loss,用batch内x0计算两两之间的manifold距离(用预训练编码网络将图像编码为一个向量并计算它们的欧式距离),同时使用预测的x^0也计算出batch内两两之间的manifold距离,优化使这两个距离极小化,目的是让diffusion预测出的图像保持和原数据集同样的manifold structure。

如果使用预训练好的编码网络会导致shortcut,所以引入对抗训练,训练编码网络极大化上述两个距离之间的差异(相当于区分fake和real的manifold structure)。

 

ConPreDiff

Improving Diffusion-Based Image Synthesis with Context Prediction

ConPreDiff

  1. 除了传统的diffusion loss(self-denoising)根据xt预测所有的xt1i,额外使用一个网络根据预测出的xt1i预测xt1i周围的neighhoods xt1j​,the ConPreDiff loss is an upper bound of the negative log likelihood. 这个loss的梯度也会传到self-denoising网络(UNet or Transformer)。

  2. 采样时只用self-denoising网络,和传统diffusion model一致。

 

Time-Dependent

The Disappearance of Timestep Embedding in Modern Time-Dependent Neural Networks

  1. 探索更好的引入timestep embedding的方式。

 

enhancement of sampling

optimal sampling schedule

DP

Learning to Efficiently Sample from Diffusion Probabilistic Models

 

RL

Learning to Schedule in Diffusion Probabilistic Models

 

AYS

Align Your Steps: Optimizing Sampling Schedules in Diffusion Models

 

LD3

Learning to Discretize Denoising Diffusion ODEs

  1. ξ是一个可学习的、固定长度的、单调递减的、从T0的sampling schedule,Ψ是ODE-Solver,Ψ是teacher solution,Ψξ是按ξ sampling schedule采样的student solution,minξLhard=minξExTN(0,σT2I)[LPIPS(Ψξ(xT),Ψ(xT))]

  2. Directly optimizing Lhard could lead to severe underfitting: to minimize the objective, we need to ensure Ψ(xT) and Ψξ(xT) for any xT , which is hard as we are only allowed to optimize ξ, which typically contains no more than 20 parameters for student ODE solvers with low NFE. We only require the existence of an input xT that is close to xT , such that the student’s output given Ψξ(xT) matches Ψ(xT). Formally, we define B(x,rσT)={x|xx2rσT} as the L2 ball of radius rσT around xminξLsoft=minξExTN(0,σT2I),xTB(xT,rσT)[LPIPS(Ψξ(xT),Ψ(xT))]

 

Distillation-based

大致分为四类:Direct Distillation, Progressive Distillation,Adaversarial Distillation,Score Distillation(DI),Consistency Distillation

 

DenoisingStudent

Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

  1. direct distillation,L=12ExT[Fstudent(xT)Fteacher(xT)22],其中Fstudent(xT)是student一步生成的样本,Fteacher(xT)是teacher多步DDIM生成的样本。

  2. 本质是使用teacher构造(xT,x0) pair数据集训练一步生成的student。

 

Diffusion2GAN

  1. SDXL

  2. We can significantly improve the quality of direct distillation by (1) scaling up the size of the ODE pair dataset and (2) using a perceptual loss, not MSE loss.

  3. 在SDXL的latent上重新训练一个VGG网络,优化student生成的latent和teacher生成的latent之间的LPIPS loss。

  4. 除了LPIPS loss,还是用对抗训练,使用类似GigaGAN的multi-scale discriminator。

 

PD

Progressive Distillation for Fast Sampling of Diffusion Models

PD-1

PD-2

  1. 训练student一步采样模拟teacher多步采样的效果。

  2. v-prediction:在distillation时ϵ-prediction参数法就不再适用了,因为DDIM采样的每一步都要先计算x^θ=1α¯t(xt1α¯tϵθ(xt,t)),在最初的几步时α¯t0,所以ϵθ的微小变化都会被放大,极端情况下,当distillation为1步时,该公式就没有意义了,所以需要重新找到一种参数方法,使x^θ在任何SNR时都比较稳定。考虑到(α¯t)2+(1α¯t)2=1,所以可以设α¯t=cos(ϕ)1α¯t=sin(ϕ)xϕ=cos(ϕ)x0+sin(ϕ)ϵ,定义vϕ=dxϕdϕ=cos(ϕ)ϵsin(ϕ)x0,所以有x0=cos(ϕ)ϵvϕsin(ϕ)=cos(ϕ)xϕcos(ϕ)x0sin(ϕ)vϕsin(ϕ)=cos(ϕ)xϕsin2(ϕ)cos2(ϕ)sin2(ϕ)x0vϕsin(ϕ)(sin2(ϕ)+cos2(ϕ))x0=x0=cos(ϕ)xϕsin(ϕ)vϕ,同理可得ϵ=sin(ϕ)xϕ+cos(ϕ)vϕ,如果使用网络vΩ(xϕt,ϕt)预测vϕt,则等价的两种参数法分别为x^Ω=cos(ϕt)xϕtsin(ϕt)vΩ(xϕt,ϕt)ϵΩ=sin(ϕt)xϕt+cos(ϕt)vΩ(xϕt,ϕt),这样x^Ω就很稳定。重写DDIM采样公式可得,xϕs=cos(ϕs)x^Ω+sin(ϕs)ϵΩ=cos(ϕs)[cos(ϕt)xϕtsin(ϕt)vΩ]+sin(ϕs)[sin(ϕt)xϕt+cos(ϕt)vΩ],经过化简可得xϕs=cos(ϕsϕt)xϕt+sin(ϕsϕt)vΩ,即xϕtΔ=cos(Δ)xϕt+sin(Δ)vθ=cos(Δ)xϕtsin(Δ)vΩ,如图所示的三角关系,DDIM采样相当于从ϵ开始向切线方向不断移动。使用该参数法重新训练diffusion model,效果也很好。distillation即为使用student网络一步预测出的vΩ计算出的x^Ω拟合teacher网络多步预测出的x^θ(不管如何参数化,最终都转换为预测出的x^0之间的拟合)。

 

CFG-PD

On Distillation of Guided Diffusion Models

CFG-PD

  1. distill CFG teacher to student。

  2. stage 1:训练一个和teacher相同步数的student,将guidance strength作为额外的条件,guidance strength也是随机均匀采样。

  3. stage 2:和PD一样,迭代训练更少步数的student。

  4. 采样时可以调用N次student达到类似2N步的随机采样。

 

SFDDM

SFDDM: Single-fold Distillation for Diffusion models

SFDDM

  1. PD是每次减少一半步数,直到目标步数,属于multi-fold。SFDDM一步到位,属于single-fold。

  2. T是teacher的时间步,T是student的时间步,TT=c,定义student的前向过程q(xt|xt1)=N(α¯ctα¯ctcxt1,(1α¯ctα¯ctc)I),确保q(xt|x0)=q(xct|x0),并推导出q(xt1|xt,x0)​,使用student进行拟合。

  3. 本质上,student就是步数很少时的DDPM,只不过是用teacher DDPM做监督,感觉没有直接训练效果好?

 

SDXL-Lightning

SDXL-Lightning: Progressive Adversarial Diffusion Distillation

  1. PD时不是用MSE而是用对抗损失进行训练。

  2. 判别器为D(xt,xtns,t,tns,c),使用UNet encoder结构,分别输入xt,t,cxtns,tns,c得到两个输出,融合后预测一个分数。The condition on xt is important for preserving the ODE flow. This is because the teacher’s generation of xtns is deterministic from xt. By providing the discriminator both xtns and xt, the discriminator learns the underlying ODE flow and the student must also follow the same flow to fool the discriminator.

 

TRACT

TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation

 

SpeedUpNet

SpeedUpNet: A Plug-and-Play Hyper-Network for Accelerating Text-to-Image Diffusion Models

  1. 为StableDiffusion额外引入一个可训练的cross-attention与negtive prompt交互。

  2. 两个loss,一个学单步,一个学多步,是冲突的?

 

PDAE-PD

Reducing Spatial Fitting Error in Distillation of Denoising Diffusion Models

 

Imagine-Flash

Imagine Flash: Accelerating Emu Diffusion Models with Backward Distillation

  1. 对于某个xt,训练student一步预测的x^0拟合teacher多步预测的x^0

  2. forward distillation:根据q(xt|x0)使用真实数据加噪得到xt;backward distillation:随机xT,使用student采样得到xt。由于训练目标是加速采样,采样时是没有ground-truth signal的,所以使用forward distillation是有exposure bias的,for forward distillation, the model learns to denoise taking into account information from the ground-truth signal, backward distillation eliminates information leakage at all time steps, preventing the model from relying on a ground-truth signal.

 

DDGAN

Tackling the Generative Learning Trilemma with Denoising Diffusion GANs

 

SIDDM

Semi-Implicit Denoising Diffusion Models

改进DDGAN。

 

UFOGen

UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs

改进SIDDM。

 

YOSO

You Only Sample Once: Taming One-Step Text-To-Image Synthesis by Self-Cooperative Diffusion GANs

YOSO

  1. define a sequence distribution of clean data pθt(x0)=q(xt)pθ(x0|xt)dxtpθ0(x0)=q(x0)pθ(x0|xt)=δ(Gθ(xt,t)),使用Et[Dadv(q(x)pθt(x))+λKL(q(x)pθt(x))]优化学习Gθ(xt,t)直接预测clean data,前一项是distribution层面对齐,后一项是point层面对齐。

  2. 然而直接在clean data上进行对抗训练无法避免GAN训练时遇到的困难,为了解决这一问题,DDGANs在corrupted data上进行对抗训练,but such an approach fails to directly match pθ(x0)​, curtailing the efficacy of one-step generation,这算一个两难问题。

  3. 受Self-Cooperative的启发,YOSO依然在clean data上进行对抗训练,但使用pθt1(x)作为ground truth学习pθt(x),即Et[Dadv(pθt1(sg(x))pθt(x))+λGθ(xt,t)x22]pθt1(sg(x))pθt(x)的样本都是用Gθ进行生成的。这个思想类似CM,不过CM是point-to-point match,且xt1xt进行ODE采样得到,YOSO是distribution match,且xt1xt是独立采样。额外使用一个类似CM的MSE loss。

  4. 训练之前,先对训练扩散模型进行微调,第一阶段转换到v-prediction,第二阶段改变noise schedule实现zero terminal SNR,之后对得到的模型直接fine-tune或LoRA fine-tune作为Gθ(xt,t)

 

HiPA

HiPA: Enabling One-Step Text-to-Image Diffusion Models via High-Frequency-Promoting Adaptation

HiPA

  1. StableDiffusion用不同步数生成图像,FT提取高低频信息,组合不同高低频,ITF转换为图像,可以看出单步生成不好的主要原因是高频信息不够好。

  2. 使用LoRA finetune StableDiffusion,让LoRA+SD单步生成的图像的高频部分和SD多步生成的图像的高频部分尽量靠近。

 

ADD

Adversarial Diffusion Distillation

ADD-1

ADD-2ADD-3
  1. student网络初始化为teacher网络,student网络的步数设为4步,{τ1,τ2,τ3,τ4},其中τ4=1000,训练时从数据集中随机挑选数据,加噪到4步中的某一步,输入student网络,一步生成x^θ,两个loss训练。

  2. GAN loss:We use a frozen pretrained feature network and a set of trainable lightweight discriminator heads. The trainable discriminator heads are applied on features at different layers of the feature network.

  3. distillation loss: Notably, the teacher is not directly applied on generations of the ADD-student but instead on diffused, as non-diffused inputs would be out-of distribution for the teacher model. 即先采样t和噪声,将student网络的预测结果x^θ加噪为x^θ,t=α¯tx^θ+1α¯tϵ,再输入teacher网络,一步生成x^θ,t1α¯tϵψ(x^θ,t,t)α¯t,和student网络的预测结果x^θ计算MSE。实际上,x^θ,t1α¯tϵψ(x^θ,t,t)α¯tx^θ=α¯tx^θ+1α¯tϵ1α¯tϵψ(x^θ,t,t)α¯tx^θ=1α¯t(ϵϵψ(x^θ,t,t))α¯t​,其与SDS是等价的,student网络作为SDS中的gθ,让其生成的样本与teacher网络生成的样本一致。

  4. 训练时只用一步生成,采样时用4步DDIM生成。

 

LADD

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

LADD

  1. 在VAE latent空间,teacher网络生成样本z0,加噪为zt,输入student网络,一步生成z^θ,对z^θ加同样的噪声为z^θ,tztz^θ,t分别输入teacher网络,用产生的feature做判别。

 

SwiftBrush

SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation

  1. Substitute the NeRF rendering with a text-to-image generator that can directly synthesize a text-guided image in one step, effectively converting the text-to-3D generation training into one-step diffusion model distillation.

 

DI

Diff-Instruct: A Universal Approach for Transferring Knowledge From Pre-trained Diffusion Models

DI-1

DI-2

DI-3

DI-4

  1. We have a pre-trained diffusion model with the multi-level score net denoted as sq(t)=xtlogq(t)(xt).

  2. We aim to train an implicit model gθ without any training data, such that the distribution of the generated samples, denoted as pg​, matches that of the pre-trained diffusion model.

  3. In order to receive supervision from the multi-level score functions sq(t), introducing the same diffusion process to the generated samples seems inevitable. Consider diffusing pg along the same forward process as the instructor diffusion model and let p(t) be the corresponding densities at time t. sp(t)=xtlogp(t)(xt).

  4. The IKL is tailored to incorporate knowledge of pre-trained diffusion models in multiple time levels. It generalizes the concept of KL divergence to involve all time levels of the diffusion process.

  5. 在同一个扩散过程下,分别以两个分布为起点,优化它们的IKL。对IKL求关于θ的梯度,使用链式法则可以引入xt,得到θlogp(t)(xt)q(t)(xt)=xtlogp(t)(xt)q(t)(xt)xtθ=[xtlogp(t)(xt)xtlogq(t)(xt)]xtθ。因此先训练sϕ(xt,t)sp(t)=xtlogp(t)(xt)进行估计,再训练θ优化IKL。

  6. SDS algorithm is a special case of Diff-Instruct when the generator’s output is a Dirac’s Delta distribution with learnable parameters. 如果gθ的输出是确定性的(相同输入会得到相同输出),则IKL退化到和SDS一样形式的loss(无需省略和近似),此时就需要训练sp(t)了。这说明SDS是Diff-Instruct的一个特例。

  7. ADD可以看成是Diff-Instruct和对抗训练的结合。

 

SDXS

SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions

SDXS

  1. 使用BK-SDM精简student model。

  2. Diff-Instruct训练student model,在[αT,T]上使用Diff-Instruct的训练方法(LDM用于训练sϕ(xt,t)LIKL用于训练student model),在[0,αT]上使用LFM​进行训练。

  3. LFM=lwlSSIM(fl(xθ(ϵ)),fl(ψ(xϕ(ϵ))))是feature matching loss,fl is l-th intermediate feature map encoded by the encoder fxθ是student model,xϕ是teacher model,ψ是ODE sampler。LFM yields favorable results with a comparison to MSE loss.

 

DMD

One-step Diffusion with Distribution Matching Distillation

DMD-1 DMD-2
  1. Distribution Matching Loss就是DI的IKL。

  2. As the distribution of our generated samples changes throughout training, we dynamically adjust the fake diffusion model,这就是为什么要额外训练一个diffusion model的原因。fake diffusion model和one-step generator是一起训练的。

 

DMD2

Improved Distribution Matching Distillation for Fast Image Synthesis

DMD2

  1. Multi-step generator (999, 749, 499, 249),和CM一样alternate between denoising and noise injection steps,如根据x999直接生成x^0,对x^0加噪得到x749,依此循环,所以训练时Gθ的输出总是x^0

  2. 为了避免training/inference mismatch,训练时的input不使用训练集的图像的加噪结果,而是使用上述方法使用Gθ生成的噪声图像。

  3. Removing the regression loss: true distribution matching and easier large-scale training.

  4. Stabilizing pure distribution matching with a Two Time-scale Update Rule. fake diffusion model和few-step generator是分开训练的。

  5. Surpassing the teacher model using a GAN loss and real data.

 

CM

Consistency Models

Consistency-Models

  1. 在具体实现中,CM是在40步EDM上训练的,所以tn+1tn是这40步中相邻的两步。

  2. For simplicity, we only consider one-step ODE solvers in this work. It is straightforward to generalize our framework to multistep ODE solvers and we leave it as future work.

  3. CM的训练目标是前后两步的consistency,而不是diffusion model的重构x0,所以xtn+1是从x0随机采样出来的,而xtn必须是根据xtn+1解出来的,它们必须对应同一个输出结果(可以不是x0)。可以随机采样xtn+1是因为PF ODE在证明时就是利用了和SDE的边缘分布相同的性质。当CM训练时采样到tn=ϵ时训练目标才是重构x0,所以其生成能力是由tn=ϵ这一步决定的,并链式的影响到之后的时间步。

 

LCM

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

LCM

Ψ为ODE-Solver,如DDIM,DPM-Solver等。

Since tntn+1 is tiny, ztn and ztn+1 are already close to each other, incurring small consistency loss and hence leading to slow convergence. Instead of ensuring consistency between adjacent time steps tn+1tn, LCMs aim to ensure consistency between the current time step and k-step away, tn+ktn. 这k步由ODE一步采样完成,相当于在Tk步的ODE上训练CM。

 

LCM-LoRA

LCM-LoRA: A Universal Stable-Diffusion Acceleration Module

StableDiffusion + LoRA作为Consistency Models。

 

RG-LCD

Reward Guided Latent Consistency Distillation

RG-LCD

 

CTM

Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion

  1. 和TCD本质上是一样的,但故事不一样,TCD抄袭CTM。TCD是先有后两步,再引入第一步,CTM是先有前后两步,再引入中间一步。

 

GCTM

Generalized Consistency Trajectory Models for Image Manipulation

  1. CTMs only allow translation from Gaussian noise to data. This work aims to unlock the full potential of CTMs by proposing generalized CTMs, which translate between arbitrary distributions via ODEs.

  2. Flow Matching is another technique for learning PFODEs between two distributions. 在Flow Matching学到的PFODEs上运用CTMs。

  3. 支持translation、editing等。

 

TCD

Trajectory Consistency Distillation

TCD-1

TCD-2

  1. 定义fθ(xt,t,s)xs,而不是直接到x0,相当于训练模型保持tn+ktntm​的consistency。

  2. 左边为CM的多步采样,右边为TCD的多步采样。CM的多步采样每一步都预测到x0​再加噪,误差大且会累积,TCD的多步采样每一步预测到某一中间步,误差小。

  3. 使用LoRA fine-tune SDXL。

 

TSCD

Hyper-SD: Trajectory Segmented Consistency Model for Efficient Image Synthesis

TSCD

  1. 将PD与TCD结合起来,称为Progressive Consistency Distillation。先将[0,T]分为k个segment,TCD训练时将TCD中的m限制在xtxt1(TCD中的xtk)所在的segment内,训练完成后将k减半接着训练,直到k=1k=1​时就和TCD等价。

  2. Consistency Distillation使用adversarial loss和MSE loss。Empirically, we observe that MSE Loss is more effective when the predictions and target values are proximate (e.g., for k=8,4), whereas adversarial loss proves more precise as the divergence between predictions and targets increases (e.g., for k=2,1).

  3. 训练完成后继续使用DMD进行enhancement。

  4. 使用LoRA fine-tune SDXL。

 

MCM

Multistep Consistency Models

与TCD的思想类似,但不同的是MCM不重新定义f,而是使用f的预测计算某一中间步的结果(DDIM中的x^0)。

 

SCott

SCott: Accelerating Diffusion Models with Stochastic Consistency Distillation

SCott

  1. 使用SDE-Solver而非ODE-Solver。

  2. 使用多步SDE。

  3. Consistency Models被参数化为预测均值和方差的模型,此时输出就是一个分布,使用KL散度优化。

 

SiD

Score identity Distillation: Exponentially Fast Distillation of Pretrained Diffusion Models for One-Step Generation

 

SiD-LSG

Long and Short Guidance in Score identity Distillation for One-Step Text-to-Image Generation

 

ODE-based

可以分为single-step和multi-step,single-step只根据当前状态预测下一状态,如DDIM,EDM,DPM-Solver,优点是实现简单,可以自启动;multi-step需要额外的历史状态预测下一状态,如PNDM,DEIS,优点是估计更精准效果更好。

 

DDIM

Denoising Diffusion Implicit Models

 

PNDM

Pseudo Numerical Methods for Diffusion Models on Manifolds

 

DPM-Solver

DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

 

ODE-Distillation-based

ODE-Distillation

Distilling ODE Solvers of Diffusion Models into Smaller Steps

邻近timestep的UNet输出具有高相关性和高相似性,ODE加速的本质就是利用UNet输出具有冗余信息的特点,利用历史输出的组合推测下一步的输出。

已有的ODE方法,如线性多步法,都有固定的组合历史输出的公式。

本方法通过优化可学习的历史输出组合系数,将已有的ODE方法进一步蒸馏,降低采样步数。

 

Caching

DeepCache

DeepCache: Accelerating Diffusion Models for Free

DeepCache

  1. 每隔N步执行一次full inference并进行cache,之后N1步使用cache feature进行计算,这样总的full inference只有TN次。

 

Unraveling

Unraveling the Temporal Dynamics of the Unet in Diffusion Models

Unraveling

 

Faster-Diffusion

Faster Diffusion: Rethinking the Role of UNet Encoder in Diffusion Models

  1. The encoder features exhibit a subtle variation at adjacent time-steps, whereas the decoder features exhibit substantial variations across different timesteps,所以可以复用之前步数的UNet encoder的输出和feature,直接输入/skip-connect到下一步的UNet decoder。

  2. The encoder feature change is larger in the initial inference phase compared to the later phases throughout the inference process,所以在复用集中在采样的中后期阶段。

  3. 还可以连续多步复用,这样多步就可以并行计算。

 

BlockCaching

Cache Me if You Can: Accelerating Diffusion Models through Block Caching

BlockCaching

  1. UNet的block输出具有三个特点:smooth change over time, distinct patterns of change, small step-to-step difference. A lot of blocks are performing redundant computations during steps where their outputs change very little. Instead of computing new outputs at every step, we reuse the cached outputs from a previous step. Due to the nature of residual connections, we can perform caching at a per block level without interfering with the flow of information through the network otherwise.

  2. 重复利用之前时间步的某些block的输出,减少运算量。

 

Δ-DiT

Delta-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

Delta-DiT

  1. 注意看剪刀,不同方法的区别在于省略的地方不同。

  2. 和之前的方法不同的是,Δ-Cache caches the difference between feature maps.

  3. Δ-Cache is applied to the back blocks in the DiT during the early outline generation stage of the diffusion model, and on front blocks during the detail generation stage.

 

TGATE

Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models

  1. This study reveals that, in text-to-image diffusion models, cross-attention is crucial only in the early inference steps, allowing us to cache and reuse the cross-attention map in later steps.

  2. 节省了最耗计算量的cross-attention map的计算。

 

L2C

Learning-to-Cache: Accelerating Diffusion Transformer via Layer Caching

L2C

  1. 训练T×D个实数,T是时间步数,D是DiT的层数,MHSA和FeedForward各算一层。

  2. 训练时,采样相邻的两个时间步sm,采样得到xs,计算ϵθ(xs,s)并cache每一层的输出;之后使用ODE从xs中求解出xm,计算ϵθ(xm,m)作为ground truth;之后计算ϵ~θ(xm,m),其DiT的每一层计算公式为hi+1m=him+g(m)(βm,ifi(him)+(1βm,i)fi(his)),其中fi是MHSA或FeedForward,hi是当前层的输入,g(m)是DiT的scale系数;ϵθ(xm,m)ϵ~θ(xm,m)22优化βm

  3. 推理时,某一层的βt,i小于某个阈值时,就设βt,i=0,这样fi(hit)的系数就为0,因此可以跳过当前层的计算。

 

Task-Oriented

Task-Oriented Diffusion Model Compression

  1. 专门针对image to image translation任务的加速,如InstructPix2Pix image editing和StableSR image restoration。

  2. Depth-skip compression:和Unraveling的(b) Removing deconv blocks一样。

  3. Timestep optimization:biased timestep selection

 

others

DG

Refining Generative Process with Discriminator Guidance in Score-Based Diffusion Models

  1. 使用预训练diffusion model生成一批样本,同时记录生成过程中的xt

  2. 从真实数据集中采样一批样本并加噪到某个随机的时间步,从生成样本数据集中采样一批样本,取相同时间步的xt,训练time-dependent discriminator D(xt,t)进行判别。

  3. 采样时,使用xtlogD(xt,t)1D(xt,t)指导采样。

 

DiffRS

Diffusion Rejection Sampling

DiffRS

  1. pθ(xt1|xt)q(xt1|xt)之间有gap,且q(xt1|xt)不一定高斯分布。

  2. 使用rejection sampling优化每一步采样,将pθ(xt1|xt)作为proposal distribution,从其中采样并以一定概率拒绝,直到接受即可完成采样。

  3. 概率的计算最终还是使用DG的time-dependent discriminator。

 

AMS

Score-based Generative Models with Adaptive Momentum

  1. 类似FDM但不需要重新训练,motivated by the Stochastic Gradient Descent (SGD) optimization methods and the high connection between the model sampling process with the SGD, we propose adaptive momentum sampling to accelerate the transforming process without introducing additional hyperparameters.

 

Skip-Tuning

The Surprising Effectiveness of Skip-Tuning in Diffusion Sampling

Skip-Tuning

  1. UNet encoder的feature为di,decoder的feature为ui,decoder使用的是两者的concatenation,即(di,ui)。分别计算原模型和用原模型蒸馏出来的模型(如CM)的di2ui2,会发现蒸馏出来的模型这个比值会降低,这启发我们:在使用原模型进行加速采样时,引入一个小于1的系数ρi,decoder中使用(ρidi,ui),会不会提升采样质量?

  2. 使用最简单的策略,定义最里层的ρbottom和最外层的ρtopρbottom<ρtop,中间层的ρi使用它们之间的插值,在EDM的5步Heun采样中,使用ρbottom=0.55ρtop=1.0,FID降到了原来的12,提升巨大。

  3. 随机选取一些图像进行加噪再一步去噪,计算所有步的diffusion loss的和,会发现skip-tuning不会使diffusion loss降低,但会使一步去噪生成的图像在feature space上(InceptionV3,CLIP等模型提取)离原图更近,所以skip-tuning提升FID的原因是对feature的优化。所以可以使用现有的模型,加一个可训练的ρ,使用diffusion loss+feature loss训练这个ρ,也可以达到上述的提升。

 

MASF

Boosting Diffusion Models with Moving Average Sampling in Frequency Domain

MASF-1

MASF-2

  1. 把diffusion model生成的过程看成参数优化的过程,因此可以引入滑动平均提高稳定性和效果。

  2. The denoising process often prioritizes reconstructing low-frequency component (layout) in the earlier stage, and then focuses on the recovery of high-frequency component (detail) later. 因此IDWT时,给不同component乘一个系数,给low-frequency component乘一个单调递减的常数,给high-frequency component乘一个单调递增的常数。

  3. 相同步数下,FID比DDIM好。

 

Exposure Bias

TS-DDPM

Alleviating Exposure Bias in Diffusion Models through Sampling with Shifted Time Steps

exposure bias:对于某个timestep t,训练时输入网络的xt和采样时采样得到的xt的分布是不同的,或者说是有domain shift的。

We search for such a time step within a window surrounding the current time step to restrict the denoising progress.

 

IP

Input Perturbation Reduces Exposure Bias in Diffusion Models

Gaussian分布建模训练时输入网络的xt和采样时采样得到的xt的分布的gap。

 

DREAM

DREAM: Diffusion Rectification and Estimation-Adaptive Models

DREAM

 

SS

Markup-to-Image Diffusion Models with Scheduled Sampling

  1. 训练diffusion model时,先从q(xt+m|x0)中采样一个xt+m,然后用diffusion model采样得到xt,训练ϵθ(xt,t)预测xtα¯tx01α¯t​,为了简便,忽略采样时产生的梯度。

  2. 该方法原本是用来解决自回归文本生成的exposure bias问题的。

 

MDSS

Multi-Step Denoising Scheduled Sampling: Towards Alleviating Exposure Bias for Diffusion Models

MDSS

 

others

Manifold Constraint

Manifold-Guided Sampling in Diffusion Models for Unbiased Image Generation

encourage the generated images to be uniformly distributed on the data manifold, without changing the model architecture or requiring labels or retraining.

利用guidance。

 

BayesDiff

BayesDiff: Estimating Pixel-wise Uncertainty in Diffusion via Bayesian Inference

利用Last-layer Laplace Approximation (LLLA)技术估计diffusion model生成样本的不确定度,which can indicate the level of clutter and the degree of subject prominence in the image.不确定度高的样本背景较为混杂,可以过滤掉。

 

CADS

CADS: Unleashing the Diversity of Diffusion Models through Condition-Annealed Sampling

  1. diversity低的原因:模型本身在小数据集上训练;cfg太大。

  2. 对输入模型的条件y进行corrupt:y^=γ(t)y+s1γ(t)ny^rescaled=y^mean(y^)std(y^)std(y)+mean(y)y^final=ψy^rescaled+(1ψ)y^,其中γ(t)为分段函数,在[t2,T]为0,在[t1,t2]为0-1的线性函数,在[0,t1]​为1。the diffusion model initially only follows the unconditional score and ignores the condition. As we reduce the noise, the influence of the conditional term increases. This progression ensures more exploration of the space in the early stages and results in high-quality samples with improved diversity.

  3. 对于class-conditional diffusion model,y为class embedding;对于StableDiffusion,y为text embedding;对于image-conditional diffusion model,y为image condition。

 

FPDM

Fixed Point Diffusion Models

FPDM

没什么新理论,只是将DiT block中间的较大的网络换成一个较小的求不动点的 x=ffpθ(x,xinput,t) 的implicit model,其中xinputfpre的输出,训练时使用 Jacobian-Free Backpropagation算法计算ffpθ的梯度。

可以根据精度要求或者计算时间需求动态调整不动点网络迭代次数。

 

DistriFusion

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

分布式推理

 

LCSC

Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better

LCSC

  1. Evolutionary Search

  2. 类似集成学习的效果。

 

Guidance

Classifier

Noisy Classifier

Diffusion Models Beat GANs on Image Synthesis

需要在xt上训练,获取exact gradient on xt

 

Classifier-Free

Classifier-Free Diffusion Guidance

 

Denoising-Assisted

Training Diffusion Classifiers with Denoising Assistance

训练noisy classifier时,把pre-trained diffusion model预测的x^0也当作条件,guidance效果更好。

 

Any Distance Estimator

The recent focus of the conditional diffusion researches is how to incorporate the conditioning gradient during the reverse sampling. This is because for a given loss function l(x), a direct injection of the gradient of the loss computed at xt produces inaccurate gradient guidance.

用Tweedie's formula根据xtϵθ(xt,t)计算x^0,输入l(x^0),计算对xt的梯度作为guidance。

 

DPS

Diffusion Posterior Sampling for General Noisy Inverse Problems

  1. 在Inverse Problems中y=Hx+ϵ,使用预训练的无条件扩散模型进行生成,每步使用xtyHx^0作为guidance。

 

MCG

Improving Diffusion Models for Inverse Problems using Manifold Constraints

 

FreeDoM

FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model

训练noisy data和condition之间的distance function并用其梯度做guidance过于消耗计算量,可以用每一步预测的噪声去计算出预测的clean data,使用现有的clean data和condition之间的distance function,即:

Dϕ(c,xt,t)Ep(x0|xt)Dθ(c,x0)

这种做法很普遍,但是效果却不稳定,对small domain(如人脸)效果很好,但对large domain(ImageNet)效果很差。原因是:The direction of unconditional score generated by diffusion models in large data domains has more freedom, making it easier to deviate from the direction of conditional control。

解决方案:利用RePaint的resample technique,循环进行xtguidancext1diffusext,相当于每一步sample都进行多次guidance,每次得到的xt都比之前的xt更加informative,aligned,harmonized

 

UGC

Universal Guidance for Diffusion Models

 

LGD

Loss-Guided Diffusion Models for Plug-and-Play Controllable Generation

 

MPGD

Manifold Preserving Guided Diffusion

 

FIGD

Fisher Information Improved Training-Free Conditional Diffusion Model

 

TFG

Understanding Training-free Diffusion Guidance: Mechanisms and Limitations

Training-Free-Guidance

两种改进方法。

 

EluCD

Elucidating The Design Space of Classifier-Guided Diffusion Generation

矫正,不过只能用于off-the-shelf的离散分类器上。

EluCD

 

PnP

Diffusion Models as Plug-and-Play Priors

Variational Inference

采样过程是对引入的variational distribution的点估计采样过程,也是对negtive ELBO最小化的过程,即对variational distribution和真实后验分布之间的KL散度的最小化。

Plug-and-Play

 

Steered-Diffusion

Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

pθ(xt1|xt,c)pθ(xt1|xt)pθ(c|xt1)pθ(c|xt)

xtlogpθ(xt1|xt,c)=xtlogpθ(xt1|xt)xtV1(xt,c)+xtV2(xt1,c)

xt1=1αt(xt1αt1α¯tϵθ(xt,t))xtV1(xt,c)+xtV2(xt1,c)+σtϵ

 

DSG

Guidance with Spherical Gaussian Constraint for Conditional Diffusion

DSG

 

DreamGuider

DreamGuider: Improved Training free Diffusion-based Conditional Generation

DreamGuider

  1. 求梯度不需要过diffusion network,降低了计算量。

  2. 受SGD算法启发使用动态scale,不需要handcrafted parameter tuning on a case-by-case basis.

 

AutoGuidance

Guiding a Diffusion Model with a Bad Version of Itself

  1. Guiding a high-quality model with a poor model trained on the same task, conditioning, and data distribution, but suffering from certain additional degradations, such as low capacity and/or under-training.

  2. D0(xt,t,c)+ω[D1(xt,t,c)D0(xt,t,c)]D1(xt,t,c)是正常训练好的模型,D0(xt,t,c)是没有训练好的或者参数量少很多的模型。

 

Asymmetric Reverse Process

DDIM reverse process中的PtDt非对称。

Asyrp

Diffusion Models Already Have a Semantic Latent Space

根据l优化模型并输出一个特定的Pt

 

Decomposed Diffusion Sampling

Fast Diffusion Sampler for Inverse Problems by Geometric Decomposition

用Tweedie's formula根据xtϵθ(xt,t)计算x^0,直接优化出一个Δx0,使得l(x^0+Δx0)尽可能小,使用x^0+Δx0作为Pt

可以看做和Asyrp等价,只是方法不同。

 

Asymmetric Gradient Guidance

Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance

Guidance

 

Inversion

why

  1. 生成模型是先有隐变量(一般是随机采样的噪声)再有生成样本,Inversion是先有真实数据(非生成的)再找到能生成它本身的隐变量。

  2. 动机是对真实数据做编辑。

 

GAN Inversion

  1. 由于mode collapse,GAN Inversion效果相对较差,过程较为复杂。

 

DDIM Inversion

  1. DDIM的生成过程可以表示为at=α¯t1α¯tbt=α¯t1(1α¯t)α¯t+1α¯t1ϵ(xt,t)=ϵθ(xt,t,ϕ)+ω[ϵθ(xt,t,c)ϵθ(xt,t,ϕ)]xt1=atxt+btϵ(xt,t),可以看出其是不可逆的,因为只根据xt1是无法解析复原xt的。如果做一个近似假设: ϵ(xt,t)ϵ(xt1,t1),上式就近似可逆: xt1=atxt+btϵ(xt1,t1)xt=xt1btϵ(xt1,t1)at,这就是DDIM Inversion(reverse ODE)时使用的公式。

  2. 对于unconditional(ω=0)和一般的conditional(ω=1)模型,这种近似还是比较准确的,x0DDIM InversionxtDDIMx^0基本可以完美重构图像,但对于large scale classifier guidance(ω>1),这种近似的误差就很大,重构效果就会很差,尤其是在步数少跨度比较大的情况下。使用50步生成时,上半部分:先根据prompt用ω=7生成一张图;用ω=0编码再解码,重构效果很好;用ω=7编码再解码,重构效果很差;下半部分:为real image写一个prompt,用ω=0编码再解码,重构效果很好;用ω=7编码再解码,重构效果很差。右边画出了ϵ(xt,t)ϵ(xt1,t1)之间的cosine similarity曲线图,表明了上述近似假设的成立程度。

EDICT

  1. 如果使用非对称的ω,即x0DDIM InversionxtxtDDIMx^0时使用不同的ω,那么DDIM Inversion时的近似误差会被放大:when ω of the sampling process is different from that of the forward process, the accumulated error would be amplifified, leading to unsatisfactory reconstruction quality. ωenc=0时使用不同ωdec重构的结果:

Prompt-Tuning

  1. 通过grid search(PSNR越大越好)可以看到:每一行中,只有ωdecωenc相同时才能做到该ωenc下最好的重构。如果一定要使用较大的ωdec,最好使用较小的ωenc

Prompt-Tuning-2

  1. 在DiffusionAutoencoder中,如果使用控制stochastic changes的inferred xT,并不能零误差复原原图,就是因为DDIM是不可逆的。同样的,对于inferred xT,使用100步编码往往比1000步编码的复原效果更差,因为1000步时ϵ(xt,t)ϵ(xt1,t)的近似更精确,而100步时这个近似误差就相对较大了。

  2. Inversion做好就可以很好的编辑了,有一些专门做精确Inversion的工作,如EDICT, Null-text Inversion, Prompt Tuning, AIDI等,参考Image Editing部分。

 

Regularized DDIM Inversion

Zero-shot Image-to-Image Translation

DDIM Inversion时每一步使用两个loss梯度下降优化ϵθ的预测结果,一个loss计算不同位置之间的相关性,另一个loss计算每个位置和标准高斯分布的KL散度。

 

Exact Inversion

Effective Real Image Editing with Accelerated Iterative Diffusion Inversion

不动点

 

On Exact Inversion of DPM-Solvers

高阶采样器的inversion

 

 

Parameter-Efficient Fine-Tuning

可以用在不同任务上,比如data-driven fine-tune,RLHF fine-tune,TI fine-tune等。

 

LoRA

LoRA: Low-rank adaptation of large language models

W=Wo+BAh=ho+Δh=Wox+BAx

 

AttnLoRA

Simple Drop-in LoRA Conditioning on Attention Layers Will Improve Your Diffusion Model

AttnLoRA

  1. The standard U-Net architecture for diffusion models conditions convolutional layers in residual blocks with scale-and-shift but does not condition attention blocks. Simply adding LoRA conditioning on attention layers improves the image generation quality.

 

TriLoRA

TriLoRA: Integrating SVD for Advanced Style Personalization in Text-to-Image Generation

  1. Compact SVD:ARm×n=UrΣrVT,其中UrRm×rVRn×r是简化后的单位正交矩阵,ΣrRr×r是最大的r个奇异值组成的对角矩阵。

  2. TriLoRA:W=Wo+UrΣrVTh=ho+Δh=Wox+UΣVTx,学习三个矩阵。

 

SVDiff

SVDiff: Compact Parameter Space for Diffusion Fine-Tuning

  1. 类似FSGAN,对卷积层参数做SVD,W=UΣVT,其中Σ=diag(σ),fine-tune模型时只学习奇异值Σ的一个偏移量,得到Σδ=diag(ReLU(σ+δ)),最终卷积层变为Wδ=UΣδVT​。因为参与训练的参数少,更不容易overfitting。

 

PET

A Closer Look at Parameter-Efficient Tuning in Diffusion Models

PET

  1. 为预训练StableDiffusion加小参数量的可训练的adapter进行transfer learning,探索了adapter位置和网络结构对训练的影响。

 

StyleInject

StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models

StyleInject

  1. 改进LoRA:W=Wo+BAh=ho+Δh=Wox+BAx

 

LyCORIS

Navigating Text-To-Image Customization From LyCORIS Fine-Tuning to Model Evaluation

LyCORIS

 

OFT

Controlling Text-to-Image Diffusion by Orthogonal Finetuning

  1. z=WTx=(RW0)Tx   s.t.   RTR=RRT=I,RIϵ,只优化R

  2. OFT时另一种fine-tune方法,比LoRA效果更好,参数更少收敛更快。

  3. 和OrthoAdaptation没有关系

 

BOFT

Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization

  1. 改进OFT。

 

SODA

Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models

PEFT-SODA

 

SCEdit

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing

SCEdit

  1. 只用SC-Tuner。

 

DiffFit

DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-Efficient Fine-Tuning

DiffFit

  1. 针对DiT的PEFT方法。

  2. 还支持将低分辨率模型fine-tune到高分辨率,对positional embedding进行插值,比如提高一倍分辨率时,原来的(i,j)变为(i2,j2)

 

Diffscaler

Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

Diffscaler

  1. DiT for incremental class-conditional generation.

  2. 为incremental class添加class embedding。

  3. Affiner: Wx+b^(1+a)Wx+b^+b+sWupReLU(Wdownx). For transformer models, we add our Affiner block for each key, query, value weights and bias parameters as well as the MLP block.

 

Text-to-Image

Awesome

VQ-Diffusion

Vector Quantized Diffusion Model for Text-to-Image Synthesis

VQVAE + multinomial diffusion

transformer blocks:input xt1,cross-attention with text,NAR prediting xt

VQ-Diffusion

 

GLIDE

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

  1. 首次提出 text 64x64 256x256 的生成模式

text 64x64:训练一个TransformerEncoder编码text(长度为K),输出K个vector(K×d),两种condition方法一起使用:

第一种:最后一个vector作为ADM中AdaGN的class embedding的替代。

第二种:K个vector也被插入到UNet各处的attention module中,具体做法是UNet的每个AttentionBlock多训练一个一维卷积(ddc),将K个vector映射到2×dc×K的维度(textual KV),当前xt的feature map(dc×h×h)被映射到3×dc×h2(visual QKV),之后dc被分成n_heads组,visual和textual的KV在尺寸维度concat起来,再做MultiheadAttention,得到的attention map大小为h2×(h2+K)。相当于一个self-attention (h2×h2) 和一个cross-attention (h2×K) 组合起来的hybrid-attention。

attention

64x64 256x256:使用ADM的super resolution方法,使用和上面相同的condition方法,但是用了一个维度更小的TransformerEncoder编码text。

  1. classifier-free guidance

上面的conditional模型训练后,再以20%概率用空串代替文本的方式对其进行fine-tune,得到classifier-free模型。

  1. Text-Guided Inpainting Model

使用预训练好的DDPM进行inpainting,即采样时每一步将xt的unmask部分替换为q(xt|x0)的采样样本,但这样做的话,每一步采样时模型是看不到完整的unmask部分的信息的,只有noisy version的,这样会造成生成样本在mask边界不自然的现象。

上面的conditional模型训练好后,随机mask x0的一部分得到masked x0,将masked x0和mask(1RGB+1mask 共四通道)concat到unmasked xt上,输入UNet预测,只计算mask region的loss,fine-tune得到一个inpainting model。模型只增加了第一个Conv层的输入通道数,其余不变。

采样时和上面一样,每一步将xt的unmask部分替换为q(xt|x0)的采样样本。

 

DALLE-2 (unCLIP)

Hierarchical Text-Conditional Image Generation with CLIP Latents

  1. text64×64256×2561024×1024。两个diffusion upsampler model,都不再以text为condition,低分辨率图像做一些数据增强后concat在xt上(64×64256×256使用Gaussian Blur,256×2561024×1024使用Diverse BSR)。To reduce training compute and improve numerical stability, we train upsamplers on random crops of images that are one-fourth the target size. We use only spatial convolutions in the model (i.e., no attention layers) and at inference time directly apply the model at the target resolution, observing that it readily generalizes to the higher resolution.

  2. Prior:text作为条件,DDPM建模image CLIP embedding。使用GLIDE的text encoder编码text,使用预训练好的CLIP编码text和image,使用TransformerDecoder模型,分别将encoded text,CLIP text embedding,timestep embedding,noised CLIP image embedding,placeholder embedding按顺序输入,使用causal attention mask(当前位置只和前面的做attention),placeholder embedding位置的输出预测unnoised CLIP image embedding。不使用ϵprediction,而是使用ziprediction,使用MSE优化。采样时,采样两个zi,然后选取zizt较小的那个zi

  1. Decoder:image CLIP embedding和text作为条件,DDPM建模image。用GLIDE的两种condition方法,第一种是将CLIP image embedding映射到指定维度替代ADM中AdaGN的class embedding,第二种是将CLIP image embedding映射成长度为4的token sequence,然后concat到上述encoded text token sequence之后(K+4),之后和GLIDE一样使用hybrid-attention。

  2. CFG:Prior:randomly dropping text conditioning information 10% of the time during training. Decoder:randomly setting the CLIP embeddings to zero (or a learned embedding) 10% of the time, and randomly dropping the text caption 50% of the time during training

 

DALL·E-3

Improving Image Generation with Better Captions

Existing text-to-image models struggle to follow detailed image descriptions and often ignore words or confuse the meaning of prompts. We hypothesize that this issue stems from noisy and inaccurate image captions in the training dataset. We address this by training a bespoke image captioner and use it to recaption the training dataset. We then train several text-to-image models and find that training on these synthetic captions reliably improves prompt following ability.

使用recaptioned dataset训练StableDiffusion。

 

CogView3

CogView3: Finer and Faster Text-to-Image via Relay Diffusion

CogView3

  1. 类似DALL·E-3使用recaptioned dataset进行训练。

  2. Base Stage是一个512×512图像压缩8倍的EDM StableDiffusion。

  3. SR Stage是一个latent space的RDM(原RDM是在pixel space的),只训练了[0,Tr]的时间步,在Tr进行交接。不使用bluring diffusion,对1024×1024x先降采样到512×512再上采样到1024×1024得到xL,编码到latent space得到zzL,前向过程变为TrtTrz+tTrzL+σϵ,相当于用插值取代了blur,也是为了解决直接上采样后有gap的问题。

  4. 采样时将Base Stage生成的图像上采样到1024×1024,编码到latent space并加噪后输入RDM进行采样。

 

SD

High-Resolution Image Synthesis with Latent Diffusion Models

 

SDXL

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

  1. 架构上:借鉴SimpleDiffusion第3条经验,架构上采用不均一的block分布,SD是[1,1,1,1],即4层每层1个block,downsample 3次,SDXL是[0,2,10],即3层,第一层直接降维,不做其余处理,第二第三层各2个和10个block,downsample 2次。使用了两个text encoder,输出concat在一起。参数量是原SD的3倍。

  2. Micro-Conditioning Image Size:由于数据集图像尺寸不统一,SD的做法是直接丢弃小尺寸的数据,但这样会损失很大一部分数据;另一种做法是把小尺寸数据upsample到大尺寸,但这种数据比真的大尺寸图像模糊,会导致模型输出图像模糊。SDXL将原图尺寸作为condition输入网络,加到time embedding上,注意:网络输出的还是目标尺寸的图像,但其模糊程度由这个condition决定。The image quality clearly increases when conditioning on larger image sizes.

  3. Micro-Conditioning Cropping Parameter:SD的一个很大的问题就是有些输出图像会截掉某个物体一部分,这是由于数据处理时,将图像长宽中较小的那一个resize后目标尺寸再对较长的那一个进行crop造成的。SDXL将crop位置作为condition输入网络,加到time embedding上,inference时输入(0,0)就能得到物体比较完整的图像。

SDXL

  1. Multi-Aspect Training:SD使用固定的输出尺寸。SDXL在目标尺寸上预训练好后,在多比例尺寸图像上进行fine-tune,做法是划分一些尺寸bucket,将图像装入最近的bucket,同一个bucket内的图像resize到bucket对应的尺寸,每次随机选一个bucket采样batch进行训练。Inference时就可以生成不同尺寸的图像(只要输入目标尺寸的噪声即可)。

  2. improved VAE-Autoencder,batch size 256(之前是9) + EMA

  3. 先在256x256上训练(带Micro-Conditioning),再在512x512上训练(带Micro-Conditioning),再在1024x1024的分辨率上进行Multi-Aspect Training(划分bucket:以1024x1024为中心,64为步长增减长宽,保持pixel总数和1024x1024接近)。

  4. Refinement Stage:We train a separate LDM in the same latent space, which is specialized on high-quality, high resolution data and employ a noising-denoising process as introduced by SDEdit on the samples from the base model. We follow and specialize this refinement model on the first 200 (discrete) noise scales. During inference, we render latents from the base SDXL, and directly diffuse and denoise them in latent space with the refinement model, using the same text input.

 

SD3

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

SD3

  1. 在Conditional Flow Matching中,zt=atz0+btϵa0=1,b0=0,a1=0,b1=1ddtzt=atz0+btϵ=atztbtϵat+btϵ=atatztbt(atatbtbt)ϵLCFM=Et,ϵvθ(zt,t)atatzt+bt(atatbtbt)ϵ2,换成ϵ-prediciton参数法,LCFM=Et,ϵ(bt(atatbtbt))2ϵθ(zt,t)ϵ2,已知SNR λt=logat2bt2,所以λt=2(atatbtbt),所以LCFM=Et,ϵ(bt2λt)2ϵθ(zt,t)ϵ2,为了统一不同方法,使用Lω=12Et,ϵ[ωtλtϵθ(zt,t)ϵ2],对于CFM,ωt=12λtbt2。不同方法(Rectified Flow,DDPM,EDM,PD v-prediction等)都可以看成ztwt不同的CFM。

  2. MMDiT架构,DiT in latent space,latent channel取16,since text and image embeddings are conceptually quite different, we use two separate sets of weights for the two modalities.

  3. 训练时SNR遵循什么样的分布采样很重要。

  4. Rectified Flow (zt=(1t)z0+tϵωt=t1t) formulations generally perform well and compared to other formulations, their performance degrades less when reducing the number of sampling steps.

 

Wuerstchen

StableCascade

Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

three-stages to reduce computational demands

  1. stage A:训练4倍降采样率的VQGAN,1024256

  2. Semantic Compressor:将图像从1024 resize到768,训练一个网络将其压缩到16×24×24

  3. stage B:diffusion建模stage A中图像quantize之前的embedding,以图像经过semantic compressor的输出为条件(Wuerstchen还以text为额外条件),相当于self-condition。

  4. stage C:diffusion建模图像经过semantic compressor的输出,以text为条件。

  5. 生成时CBA

 

Imagen

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

  1. text64×64256×2561024×1024,三个模型都要condition text。

text64×64:GLIDE的两种conditon方法。

64×64256×256:Efficient UNet. GLIDE的两种conditon方法。

256×2561024×1024:Efficient UNet. 不用self-attention,只用cross-attention,降低运算量。

Use noise conditioning augmentation for both super resolution models。

  1. 三个模型都有CFG。

  2. Dynamic thresholding(只针对采样)

使用比较大的classifier guidance weight时,每一步得到的x^0的像素值容易出界,一般都是直接截取到(1,1),Imagen动态,每一步将中x^0中所有像素值取绝对值并排序,将s设为一定百分位(比如80%)的像素值,如果s>1,将所有像素值截取至(s,s)再除以s,否则还是按原来的方法截取。

  1. 纯text encoder比image-text联合训练出来的text encoder要好。

 

YaART

YaART: Yet Another ART Rendering Technology

YaART

  1. text64×64256×2561024×1024​,前两个模型都要condition text,最后一个模型不用condition text。

text64×64:GLIDE的两种conditon方法。

64×64256×256:只使用AdaGN的condition方法。

256×2561024×1024​:Efficient UNet,不condition text。

  1. fine-tune text64×64 and 64×64256×256 with high-quality image-text pairs, fine-tune 256×2561024×1024 with SR dataset.

  2. RL alignment for text64×64.

 

eDiffi

eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

  1. text64×64256×2561024×1024,同Imagen。

  2. 同时使用T5 text encoder和CLIP text encoder。

  3. 发现了不同时间步对文本的利用程度不同

提出模型分裂法,每个子模型只针对某个子level的noise进行训练,称为expert,最终模型为Ensemble of Expert Denoisers。

 

RAPHAEL

RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths

  1. eDiffi是使用不同timesteps的experts进行生成,这里还使用不同space的experts进行生成。

  2. Space MoE:根据cross-attention map使用阈值法确定某个word的mask,再根据word由route网络选择某个expert,由该expert生成该word对应的feature,所有word的feature乘上对应的mask取平均作为输出。

 

PixArt-α

PixArt-alpha: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

PixArt-alpha

  1. DiT加入cross-attention引入text。

  2. 在DiT架构中,AdaLN的参数量竟然占到了DiT的27%,但在文生图中没有类别条件只有时间步,所以不需要这么多参数,从而使用AdaLN-single:在所有block之外使用一个MLP根据timestep预测出的global AdaLN参数(共6个),每个block再额外训练一个长度为6的AdaLN参数,和global AdaLN参数相加得到最终的AdaLN参数,极大地降低了参数量。

  3. 三阶段训练:使用一个预训练的class-conditional ImageNet模型作为初始化,一方面可以节省text-to-image的训练时间,一方面class-conditional模型训练起来较为容易且不费时;使用高度对齐的、高密度信息的文本的数据集进行训练,实现text-image alignment;类似Emu,使用少量的高质量图像进行fine-tune。

 

PixArt-Σ

PixArt-sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

PixArt-sigma

  1. 使用更高分辨率的图像和细粒度的caption进行训练。

  2. 为了减少参数量在self-attention中使用KV compression,因为相邻的R×R中的feature具有相似性,比较冗余,所以使用一个卷积层对KV进行缩小,使用2×2的卷积层,参数被初始化为1R2,让它一开始就等价于一个average pooling。Q还是保持不变,以保留信息。

  3. Weak-to-Strong Training Strategy:PixArt-α作为weak模型。直接将VAE替换为高分辨率图像的VAE;切换到高分辨率时使用DiffFit的positional embedding插值;即使weak模型上没有使用KV compression,也可以直接在strong模型训练时使用KV compression。

 

GenTron

GenTron: Diffusion Transformers for Image and Video Generation

GenTron

  1. adaLN design yields superior results in terms of the FID, outperforming both cross-attention and in-context conditioning in efficiency for class-based scenarios. However, our observations reveal a limitation of adaLN in handling free-form text conditioning. Cross-attention uniformly excels over adaLN in all evaluated metrics.

 

PanGu-Draw

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

PanGu-Draw-1

PanGu-Draw-2

PanGu-Draw-3

  1. Cascaded Training:不同分辨率的三个模型分别训练。Resolution Boost Training:先在低分辨率上训练,再在高分辨率上训练。

  2. Time-Decoupled Training:将时间步分为两个阶段,前一阶段主要负责生成形状,后一阶段负责refine。前一阶段需要使用大量的text-image pair进行训练以让模型学习不同的concept,之前的模型都过滤掉低分辨率,但这里不需要,将低分辨率也上采样到高分辨率进行训练,因为前一阶段生成的是xTstruct​,其本来就是带噪的,使用低分辨率上采样后的模糊图像并不影响效果。后一阶段在低分辨率上进行训练,在高分辨率上进行采样。

  3. Coop Diffusion:不同隐空间和不同分辨率训练的扩散模型可以一起用于采样,以image space为中介进行转换。

 

ParaDiffusion

Paragraph-to-Image Generation with Information-Enriched Diffusion Model

  1. 解决长文本复杂场景的生成问题。

  2. 使用decoder-only的language model训练t2i模型,好处是gpt已经展现出了强大的能力,对长文本已经有很好的建模,且训练数据多,缺点是pre-trained decoder-only模型feature extraction能力不太行,所以需要adaption。efficiently fine-tuning a more powerful decoder-only language model can yield stronger performance in long-text alignment (up to 512 tokens)

ParaDiffusion

 

KNN-Diffusion

KNN-Diffusion Image Generation via Large-Scale Retrieval

不需要text-image pair进行训练,用image做条件,CLIP做桥梁。训练时根据image间CLIP编码的cosine距离,使用KNN算法找出和训练image相似的N个image作为条件。采样时根据text和image间CLIP编码的cosine距离,使用KNN算法找出和采样text相似的N个image作为条件。

 

RDM

Retrieval-Augmented Diffusion Models

使用训练数据的k-NN的CLIP embedding作为条件进行训练,采样时,可以根据文本挑选k-NN进行生成,也可以直接使用文本的CLIP embedding。

 

Enhancement

Re-Imagen

Re-Imagen: Retrieval-Augmented Text-to-Image Generator

Though state-of-the-art models can generate high-quality images of common entities, they often have difficulty generating images of uncommon entities. A generative model that uses retrieved information can produce high-fidelity and faithful images, even for rare or unseen entities.

Given a text prompt, Re-Imagen accesses an external multi-modal knowledge base to retrieve relevant (image, text) pairs and uses them as references to generate the image. Re-Imagen is augmented with the knowledge of high-level semantics and low-level visual details of the mentioned entities, and thus improves its accuracy in generating the entities’ visual appearances.

Re-Imagen-1

在UNet encoder后加一个cross-attention与neighbors做交互,同样使用UNet encoder编码neighbors作为key-value,t设为0,所有参数一起训练。

Re-Imagen-2

采样时可以自己提供reference image作为neighbor,实现类似Textual Inversion的效果。

 

CAD

Don’t Drop Your Samples! Coherence-Aware Training Benefits Conditional Diffusion

  1. 使用CLIP对text-to-image数据集进行相似度打分,经过处理后转换为01的coherence score。

  2. 训练diffusion model时,将coherence作为额外的条件。

  3. 生成时使用coherence score的CFG:ϵθ(xt,y,1,t)+ω(ϵθ(xt,y,1,t)ϵθ(xt,y,0,t))

 

Latent Transparency

Transparent Image Layer Diffusion using Latent Transparency

Latent-Transparency

  1. 用透明图像数据训练一个编码器和一个解码器:编码器根据RGB图像和alpha图像预测一个VAE latent空间的偏移量latent transparency,该偏移量加在RGB图像的latent上,相当于对latent distribution做修改,这么做的目的是为了让解码器可以根据修改后的latent预测出RGB图像和alpha图像,但同时应该尽可能少地影响VAE重构效果,让StableDiffusion可以正常运行。loss分为两部分,第一部分是解码器重构RGB图像和alpha图像的loss,第二部分是VAE重构loss,约束编码器预测的偏移量latent transparency不要影响latent distribution。

  2. 在新的latent distribution上fine-tune StableDiffusion。

 

LayerDiff

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

LayerDiff-1

LayerDiff-2

  1. 一个background layer和K个foreground layer,每个foreground layer对应一个image和mask,foreground layer之间不重叠,最终图像是所有foreground layer拼在一起,然后background layer填补空隙。

  2. 使用InstructBLIP、SAM、StableDiffusion inpainting模型造数据训练。

 

AFA

Ensembling Diffusion Models via Adaptive Feature Aggregation

AFA

  1. 集成学习。

  2. AFA dynamically adjusts the contributions of multiple models at the feature level according to various states (i.e., prompts, initial noises, denoising steps, and spatial locations), thereby keeping the advantages of multiple diffusion models, while suppressing their disadvantages.

 

Diffusion-Soup

Diffusion Soup: Model Merging for Text-to-Image Diffusion Models

  1. Diffusion Soup enables training-free continual learning and unlearning with no additional memory or inference costs, since models corresponding to data shards can be added or removed by re-averaging.

  2. Diffusion Soup approximates ensembling, and involves fine-tuning n diffusion models on n data sources respectively, and then averaging the parameters.

 

Asymmetric VQGAN

Designing a Better Asymmetric VQGAN for StableDiffusion

改进StableDiffusion要建模的隐空间。

为decoder设计了一个conditional branch,输入task-specific prior,如unmasked image in inpainting。

decoder远比encoder大,提升细节重构能力。

 

Counting Guidance

Counting Guidance for High Fidelity Text-to-Image Synthesis

用pre-trained counting network,输入每一步的x^0,输出的count与期望的count作差计算loss,求梯度作为guidance。

 

SAG

Improving Sample Quality of Diffusion Models Using Self-Attention Guidance

SAG

  1. 相比于classifier-free guidance借助condtional score计算guidance,self-attention guidance使用internal information计算guidance,training-free and condition-free,所以比较通用,可用于任何diffusion model的enhancement。

  2. classifier guidance的u就是要远离的目标,如果是uncondtional model,可以将其输出作为c,然后人为定义一个u,这里使用每一步生成的x^0的Gaussian Blur的噪声版本的score作为u,称为Blur Guidance: Gaussian blur reduces the fine-scale details within the input signals and smooths them towards constant, resulting in locally indistinguishable ones. 但这样会导致生成图像含有噪声,We assume that this is because global blur introduces structural ambiguity across entire regions. 所以提出只在显著位置使用Gaussian blur。

  3. self-attention mask:再次利用self-attention,unnormalized self-attention map大小为RN×(HW)×(HW),其中N为head数,在N×(HW)维使用global average pooling (GAP) ,再reshape and upsample到图像尺寸。使用均值作为阈值确定一个mask,该mask对应图像的高频部分,只取Gaussian blur的mask部分。

 

PAG

Self-Rectifying Diffusion Sampling with Perturbed-Attention Guidance

PAG

  1. 类似SAG,使用I替代self-attention map作为CFG的unconditional score进行采样。

 

Online Self-Guidance

Guided Diffusion from Self-Supervised Diffusion Features

类似SAG,利用数据本身的UNet feature做guidance。

Our method leverages the inherent guidance capabilities of diffusion models during training by incorporating the optimal-transport loss. In the sampling phase, we can condition the generation on either the learned prototype or by an exemplar image.

需要全部重新训练。

 

Attention-Regulation

Enhancing Semantic Fidelity in Text-to-Image Synthesis: Attention Regulation in Diffusion Models

Attention-Regulation

  1. 某个token的cross-attention的dominance导致了其它token的semantic的丢失。

  2. 需要额外输入一个token index集合(可以自动提取所有名词),取该集合内每个token对应的cross-attention map,对于每个cross-attention map,计算其90%分位数的响应值与一个预设值之间的MSE loss,目的是让这些指定的token的cross-attention map的响应值较大(类似A&E),同时计算每个cross-attention map的响应值的和与一个预设值之间的MSE loss,目的是上这些token的cross-attention map的响应值均衡。给cross-attention map计算公式中加一个参数化的S得到Softmax(QKT+Sd),使用两个loss优化S​。

  3. We choose cross-attention layers in the last down-sampling layers and the first up-sampling layers in the U-Net for optimization.

  4. 为了稳定,使用EMA更新。

 

Attention-Modulation

Towards Better Text-to-Image Generation Alignment via Attention Modulation

  1. training-free

  2. self-attention temperature control:计算attention时使用较小的temperature,让softmax的分布更加集中,high attention values between patches with strong correlations are emphasized, while low attention values between unrelated patches are suppressed. After temperature control, the patch only corresponds with patches within a smaller surrounding area, leading to the correct outlines being constructed in the final generated image. We apply the temperature operation to the early generation stage of the diffusion model in the self-attention layer.

  3. object-focused masking mechanism:对prompt进行拆分,分为带形容词的物体、动词、介词等主体,计算prompt中不同主体对应的cross-attention map之和(每个主体可能不止一个word)作为该主体的cross-attention map,之后遍历所有pixel,对于每个pixel,选出其cross-attention map响应值最大的那个主体,将该pixel分配给该主体,在其它主体的所有word的cross-attention map上mask掉该pixel(响应值设为0)。With this masking mechanism, for each patch, we retain semantic information for only the entity group with the highest probability, along with the global information related to the layout. This approach helps reduce occurrences of object dissolution and misalignment of attributes.

 

VP

Visual Programming for Text-to-Image Generation and Evaluation

fine-tune LLM on text-layout paris,使其可以将text转换为layout,和text一起作为条件输入GLIGEN,辅助精确可控生成。

 

MuLan

MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion

a training-free Multimodal-LLM agent that can progressively generate multi-object with planning and feedback control, like a human painter.

 

RPG

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

对某个较长的caption,使用ChatGPT将其分解为n个sub-caption,再对每个sub-caption进行recaption,并为每个sub-caption在图中分配一个layout。

生成时,这n个sub-caption组成一个batch送入SDXL UNet,每当碰到cross-attention,将其输出的latent按照其sub-caption对应的layout的尺寸进行resize,并将n个resize的latent在空间中重新concat到原来的空间尺寸,再送入之后的网络。

为了确保concat边界的一致性,使用原caption的cross-attention输出的latent,与concat后的latent做插值。

 

ERNIE-ViLG 2.0

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts

  1. 通过引入先验知识提高image-text对齐程度的优化训练算法。

  2. 利用NLP工具标注出text中的关键词,并在cross-attention中提高其与image token的attention的权重。

  3. 利用object detection检测出text中的object的区域,提高这一区域的diffusion loss的权重。

 

TokenCompose

TokenCompose: Grounding Diffusion with Token-level Supervision

TokenCompose

  1. 利用SAM提取prompt中名词对应的object的mask,fine-tune StableDiffusion,除了diffusion loss,还加了两个cross-attention map的辅助loss。

  1. Ltoken=1NiN(1uMiCAMi,uuCAMi,u),即提高mask区域内响应值和的比例,但该loss不保证响应值均匀,如果高响应值集中于Mi的一个子区域也能优化该loss,所以加一个全区域的交叉熵loss,Lpixel=1NLiuMi,ulog(CAMi,u)+(1Mi,u)log(1CAMi,u)

 

DIFFNAT

DIFFNAT: Improving Diffusion Image Quality Using Natural Image Statistics

We propose a generic "naturalness" preserving loss function, kurtosis concentration (KC) loss,和diffusion loss一起训练。

 

ITI-Gen

ITI-GEN: Inclusive Text-to-Image Generation

make the pre-trained StableDiffusion to generate images which are uniformly distributed across attributes of interest.

有点类似TIME和UCE那种model editing,但这里只是修改prompt (prompt tuning),不对模型做任何更改,需要提供一个reference image dataset作为attributes of interest。

 

MaskDiffusion

MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask

  1. StableDiffusion

  2. training-free

  3. 只在16x16的分辨率上进行操作。

  4. cross-attention map有三种bad case:

cross-attention-bad-cases

  1. 做法是使用region selection算法,挑选出每个text token对应的区域,提高其cross-attention map的response,在cross-attention map中尽量分开不同token对应的区域。softmax(QKTd+M)M初始化为0,对i-th text token,如果j-th image token在其对应的区域内,则Mji=Mji+w0,其中w0为预设常数。

 

A&E

Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models

A&E

  1. StableDiffusion

  2. training-free

  3. 只在16x16的分辨率上进行操作。

  4. 生成过程中每一步梯度下降优化zt,鼓励每个subject token对应的cross-attention map中至少有一个patch具有high response,目的是保证物体的存在性,使用优化后的zt去生成zt1,循环往复。

 

D&B

Divide and Bind Your Attention for Improved Generative Semantic Nursing

  1. StableDiffusion

  2. training-free

  3. 用total variation loss代替上面的loss,这样就不局限在某个patch点上了,激励整个区域。

  4. 另外引入了一个bind loss,其动机是prompt中还存在一些修饰subject token的形容词,这些形容词对应的cross-attention map应该和对应名词的cross-attention map是对齐的,所以引入它们(归一化后)之间的JS散度作为loss。

 

ELA

Easing Concept Bleeding in Diffusion via Entity Localization and Anchoring

ELA

  1. 类似DiffEdit使用cross-attention map估计出mask,之后进行自我增强。

 

INITNO

INITNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization

INITNO

  1. 根据第一步生成时的cross-attention map和self-attention map优化initial noise的重参数分布,保证物体的存在性,解决subject mixing问题。

  2. SCrossAttn是cross-attention response score,和A&E中的loss一样,保证物体的存在性。

  3. SSelfAttn是self-attention conflict score,existing diffusion models suffer from self-attention map overlap, leading to a failure case of subject mixing. 对于任意两个不同的subject的token,找出各自的cross-attention map中最大的响应值的位置,并分别找出该位置对应的self-attention map,shape都为H×W,遍历这HW个位置,计算两个self-attention map在每个位置上两者中的最小值除以两者的和,目的是让两个self-attention map在相同位置上的响应值一高一低,减少overlap。

  4. 两个score如果都低于各自的阈值,则说明不需要继续优化,直接采样并进行生成。

  5. Ljoint包含两个score以及一个额外的KL散度,防止initial noise的重参数分布偏离N(0,I)

 

ConceptDiffusion

Semantic Guidance Tuning for Text-To-Image Diffusion Models

ConceptDiffusion

  1. 将prompt的score拆成不同concept的score的组合,subject concept的score直接计算,abstract concept的score由正交投影计算,组合时计算不同concept的score和prompt的score的相似度决定weight。

 

DreamWalk

DreamWalk: Style Space Exploration using Diffusion Guidance

  1. 将prompt分解为不同的子prompt,使用不同子prompt的CFG的线性组合进行生成。

 

Local-Control

Local Conditional Controlling for Text-to-Image Diffusion Models

  1. StableDiffusion + ControlNet

  2. training-free

  3. 如果ControlNet的输入只包含一个物体的控制信息,比如对于prompt"a dog and a cat",ControlNet的输入只包含了cat的bounding box,the prompt concept that is most related to the local control condition dominates the generation process, while other prompt concepts are ignored. Consequently, the generated image cannot align with the input prompt. dog容易消失。

  4. 对于有local control的物体,使用控制信息大致估算出一个mask,计算该物体对应的token的cross-attention map在mask内最大值和mask外最大值的差作为loss,对于没有local control的物体,将mask外视为自己的区域,mask外视为非自己的区域,用同样的方法计算loss,loss求和,求梯度作为guidance。

  5. 将mask用在ControlNet的skip connection feature上,使得ControlNet只影响mask内的feature。

 

A-STAR

A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis

  1. StableDiffusion

  2. training-free

  3. attention overlap问题。解决:计算不同token对应的cross-atttenion map的IoU。

  4. attention decay问题:作者发现StableDiffusion生成早期的cross-atttenion map的布局是比较清晰的,但越到后期这种布局越模糊,没有保持住。所以可以利用上一步的cross-attention map估算一个mask,计算这一步的cross-attention map与这个mask的IoU。

  5. 3中的loss减去4中的loss,求梯度作为guidance。

 

Multi-Concept T2I-Zero

Multi-Concept T2I-Zero: Tweaking Only The Text Embeddings and Nothing Else

  1. StableDiffusion

  2. training-free,只在text embedding上做文章。

  3. text中首先出现的concept往往在生成中占主导地位,可能抢占其它concept,并且这些首先出现的concept的token embedding往往有比较大的normalization,通过scale down可以缓解。

  4. 某些concept的生成可能和它对应的embedding没关系,而是根据其它embedding生成。计算当前embedding和其它embedding的相似性,用其它embedding的加权和表示当前embedding。

 

FreeU

FreeU: Free Lunch in Diffusion U-Net

training-free,只用两个系数提高生成效果。

UNet的decoder的feature由两部分组成,一个是自己网络生成的backbone feature,另一个是同分辨率下encoder skip connection过来的skip feature。

给backbone feature乘一个系数b,随着b的增大,UNet的denoising能力增强,生成图像的质量变高,但高频信息被抑制。

实验证明了skip feature含有更多高频细节的信息,所以给skip feature的FFT乘一个稍大的系数,复原被抑制的高频信息。

 

Emu

Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack

LLM可以通过在高质量小数据集上fine-tune的方式显著提高模型输出质量,并且不会影响其泛化能力。

StableDiffusion本身已经具备生成高质量图像的能力,但并没有被有效发掘,导致生成质量参差不齐,Emu通过人工筛选2000张极高质量的图像对StableDiffusion进行fine-tune,让StableDiffusion保持生成高质量图像的能力,同时不失对文本的泛化性。

early stopping(<15k iterations)避免过拟合。

该方法很通用,还适用于pixel-level diffusion models(Imagen)和masked generative models(Muse)。

 

Semantic Refinement

Fine-grained Text-to-Image Synthesis with Semantic Refinement

KNN-Diffusion (language-free training),采样时根据text中的semantic选取reference image,给reference image加噪,计算noised reference image和xt的CLIP embedding的点积,计算其梯度作为guidance。

 

T2I-Salad

Imagine That! Abstract-to-Intricate Text-to-Image Synthesis with Scene Graph Hallucination Diffusion

generating intricate visual content from simple abstract text prompts

  1. 自监督训练一个scene graph的discrete diffusion model,根据simple abstract text prompts生成语义更丰富的scene graph

  2. 给StableDiffusion插入scene graph attn进行训练。

 

BeautifulPrompt

BeautifulPrompt: Towards Automatic Prompt Engineering for Text-to-Image Synthesis

收集low-quality prompt和high-quality prompt pair的数据集,训练一个语言模型,根据low-quality prompt生成high-quality prompt,使得prompt engineer自动化。

 

DPO-Diff

On Discrete Prompt Optimization for Diffusion Models

  1. Our main insight is that prompt engineering can be formulated as a discrete optimization problem in the language space.

  2. To the best of our knowledge, this is the first exploratory work on automated negative prompt optimization.

 

ConceptSliders

Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models

ConceptSliders

  1. 对于target concept的纠错或者编辑。

  2. η是常数,sliding指scale LoRA的系数,即W+αΔW中的α

  3. 可以通过prompt engineering设计enhanced and suppressed attribute,可以解决hands生成等问题。

 

Contrastive Guidance

Contrastive Prompts Improve Disentanglement in Text-to-Image Diffusion Models

相比于negative prompt使用一些抽象的prompt例如low quality和ugly,contrastive prompt针对prompt设计,去除一些形容词,或使用一些反义prompt,比如with改为without。

 

LaVi-Bridge

Bridging Different Language Models and Generative Vision Models for Text-to-Image Generation

训练一个adapter以结合不同预训练语言模型和预训练文生图模型。

给定任意预训练的文本编辑器f和文生图生成器g,对于文本yc=f(y)g(h(c)),使用g的loss训练MLP adapter h,同时LoRA fine-tune fg,只需要少量text-image pair即可完成adaptation。

 

Multi-LoRA

Multi-LoRA Composition for Image Generation

training-free,直接将每个LoRA的输出ϵθ,θi相加进行采样(而不是将LoRA参数相加)。

 

DiffChat

DiffChat: Learning to Chat with Text-to-Image Synthesis Models for Interactive Image Creation

DiffChat can effectively make appropriate modifications and generate the target prompt, which can be leveraged to create the target image of high quality.

LLM和用户对话,根据用户需求,只对prompt进行修改,不涉及image识别。

 

Syntax

StructureDiffusion

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

StructureDiffusion

  1. 增强属性绑定。

  2. 利用语法结构,提取text中的noun phrase(共k个),使用CLIP text encoder提取每个noun phrase的embedding,替换原text中noun phrase对应位置的embedding,与cross-attention map相乘,算上原text一共得到k+1个output,取平均作为输出。

 

SG-Adapter

SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance

SG-Adapter

  1. 对CLIP text embedding进行adaptation,使得生成图像的语义更准确。

  2. 使用NLP parser提取text中的subject-relation-object三元组(可能有多个),每个三元组构成一个scene graph,对于每个scene graph,concat三元组单词的CLIP text embedding,过一个线性层得到scene graph embeding,原CLIP text embedding作为Q,scene graph embeding作为KV,进行cross-attention,得到refined text embedding。计算cross-attention map时,只有Q当前的token属于K当前的scene graph时才计算,其余都mask掉。

 

Memorization

MemAttn

Unveiling and Mitigating Memorization in Text-to-image Diffusion Models through Cross Attention

MemAttn-1

MemAttn-2

  1. cross-attention map为HW×L​,每一行的和为1,对于non-memorization图像,每一行的cross-attention score大都集中在begining token上,并且随着t的减小更加集中;而对于memorization图像,每一行的cross-attention score在begining token上分配的很少,但会集中在某个特定的token上。

  2. cross-attention map为HW×L,对每一列取平均值,然后对每一列求熵,所有列的熵求和,得到attention entropy,attention entropy越高代表更分散的cross-attention score分布。对于non-memorization图像,attention entropy随着t的减小快速下降;对于non-memorization图像,attention entropy比non-memorization的要高。

  3. 利用这些发现可以做检测。

  4. 缓解memorization:直接调节cross-attention的logits,给begining token的logits乘一个较大的数,让cross-attention score大都集中在begining token上。

 

AMG

Towards Memorization-Free Diffusion Models

  1. Anti-Memorization Guidance:设计了三个防止生成memorization sample的度量函数,求梯度作为guidance。

 

NeMo

Finding NeMo: Localizing Neurons Responsible For Memorization in Diffusion Models

  1. We propose to localize memorization of individual data samples down to the level of neurons in DMs’ cross-attention layers.

  2. By deactivating these memorization neurons, we can avoid the replication of training data at inference time, increase the diversity in the generated outputs, and mitigate the leakage of private and copyrighted data.

 

Guidance

GuidanceInterval

Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models

  1. We propose to only apply guidance in a continuous interval of noise levels in the middle of the sampling chain and disable it elsewhere. 在EDM上定义一个(σlo,σhi)区间,只在该区间内进行CFG,其余区间进行正常的条件采样。

 

DynamicGuidance

Analysis of Classifier-Free Guidance Weight Schedulers

DynamicGuidance

  1. Simple, monotonically increasing weight schedulers consistently lead to improved performances.

 

S-CFG

Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance

S-CFG-1

S-CFG-2

  1. We argue that the CFG scale should be spatially adaptive, allowing for balancing the inconsistency of semantic strengths for diverse semantic units in the image.

  2. cross-attention map的shape是HW×L,每一行和为1。先沿列做normalization,让每一列和为1,之后选出每一行最大的那个响应值的位置所在的token作为这个pixel对应的token。这个normalization很重要,不然响应值会集中在begining token上,这一点在MemAttn中也有展现。

  3. cross-attention map的segmantation很粗糙,所以使用self-attention map进行refine,做法是直接将self-attention map和cross-attention map相乘,得到的结果再进行1中的操作。

  4. 进一步优化,计算C¯=1Ri=1RSrC,再进行1中的操作,R4

  5. 将CFG中的ϵθ(zt,t,c)ϵθ(zt,t,ϕ)拆分成M个的和i=1Mγt,imt,i[ϵθ(zt,t,c)ϵθ(zt,t,ϕ)],其中mt,i是第i个token对应的mask,γt,i为rescale系数,另ηt=∥ϵθ(zt,t,c)ϵθ(zt,t,ϕ)2RHW,则γt,i=γ|mt,bηt||mb,i||mt,i||mt,iηt|,其中mb,i是用begining token估计出的background的mask,相当于把token对应区域的CFG的均值rescale到background的均值。

 

WorkingMechanism

Towards Understanding the Working Mechanism of Text-to-Image Diffusion Model

  1. During the denoising process of the stable diffusion model, the overall shape and details of generated images are respectively reconstructed in the early and final stages of it.

  2. The special token [EOS] dominates the influence of text prompt in the early (overall shape reconstruction) stage of denoising process, when the information from text prompt is also conveyed. Subsequently, the model works on filling the details of generated images mainly depending on themselves.

  3. 在early stage使用CFG,在final stage只使用unconditional score,因此减少了final stage一半的计算量。

 

GuideModel

Plug-and-Play Diffusion Distillation

GuideModel

  1. CFG需要两次forward,计算量太大,因此给模型学习一个guide model作为adapter,与ControlNet对称,将scale作为参数输入,蒸馏CFG。

 

Character

TCO

The Chosen One Consistent Characters in Text-to-Image Diffusion Models

TCO

  1. 形容某个character的不同prompt生成具有相同特征的character。

  2. 生成、聚类、用选中的类别(具有相同特征的character的图像)进行LoRA fine-tune。

 

OneActor

OneActor: Consistent Character Generation via Cluster-Conditioned Guidance

OneActor-1

OneActor-2

  1. Create consistent images of the same character.

  2. 类似Pix2Pix-Zero,使用一个可训练网络预测text embedding中character word的Δc,三个loss都是diffusion loss。

 

SFT/RL

AdaDiff

AdaDiff: Adaptive Step Selection for Fast Diffusion

预定义一个步数集合,训练一个轻量级的步数选择网络,根据text embedding从集合中选择一个步数进行生成,根据生成结果打分,policy gradient优化网络。

 

ImageReward

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

设计一个系统,训练text-to-image human preference reward model,并不训练text-to-image模型或者修改采样过程,只起筛选作用,挑选出StableDiffusion生成的较好的image,类似CLIP的作用。

 

LVLM-ImageReward

Improving Compositional Text-to-image Generation with Large Vision-Language Models

使用Large Vision-Language Models评定生成图像与文本的对齐性,主要是object number, attribute binding, spatial relationship, aesthetic quality四个方面的对齐,然后使用ImageReward的方式对diffusion model进行fine-tune。

 

RAHF

Rich Human Feedback for Text-to-Image Generation

RichHF-18K dataset includes two heatmaps (artifact/implausibility and misalignment), four fine-grained scores (plausibility, alignment, aesthetics, overall), and one text sequence (misaligned keywords)

 

RLHF

Aligning Text-to-Image Models using Human Feedback

  1. 通过引入人工标注反馈提高image-text对齐程度的fine-tune pre-trained StableDiffusion算法。

  2. StableDiffusion对于一些概念生成还是会时好时坏的,比如count和color,为此可以使用count和color进行造句(可以选其它你认为没有对齐好的概念使用该算法,这里仅以count和color举例),再用每个text生成60多张image,由labeler进行0-1标注,0代表没有对齐(count错了或color错了),1代表对齐。

  3. 训练一个reward function,根据上述image和text的CLIP编码去预测对齐程度(输出0~1),用标注数据进行训练,使用MSE Loss;同时使用数据增强方法(prompt classification)提升reward function性能:对每个已经标注为对齐的image-text pair,将text中的count或color进行更改,生成N-1个与imgae非对齐的text,输入image和N个text到reward function中并输出N个预测值,softmax后使用交叉熵进行分类训练。

  4. 使用标注数据集的图像和句子fine-tune StableDiffusion,使用reward-weighted NLL,提高预测对齐程度较高的图像和句子对模型的贡献。

 

RLHF

Censored Sampling of Diffusion Models Using 3 Minutes of Human Feedback

训练time-dependent reward model,采样时use the score of time-dependent reward function as guidance。

 

FABRIC

FABRIC: Personalizing Diffusion Models with Iterative Feedback

reference image加噪过UNet,保留所有self-attention的key-value,生成时将这些key-value concat在生成时的self-attention的key-value后进行计算。

根据feedback,高分的作为cfg的conditional,低分的作为cfg的unconditional,使用上述方法进行生成。

 

DPOK

DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models

Policy Gradient fine-tune pre-trained diffusion model,z为text或其它条件:

DPOK

为了避免fine-tune过拟合,加了fine-tuned model生成的x0与原模型生成的x0之间的KL正则。

 

Diffusion-DPO

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Diffusion Model Alignment Using Direct Preference Optimization

Diffusion-DPO

  1. 将DPO拓展到整个diffusion chain上。

 

Curriculum-DPO

Curriculum Direct Preference Optimization for Diffusion and Consistency Models

Curriculum-DPO

  1. 结合了Curriculum Learning的Diffusion-DPO,先学简单的(分差大的)再学难的(分差小的)。

 

RLCM

RL for Consistency Models: Faster Reward Guided Text-to-Image Generation

RLCM

 

TexForce

Enhancing Diffusion Models with Text-Encoder Reinforcement Learning

现有的T2I模型大都使用预训练的text encoder,且生成时都需要prompt engineering,这都说明text encoder是suboptimal的,所以可以将T2I生成时的不对齐归因于suboptimal text encoder,所以提出使用RLHF + LoRA fine-tune text encoder,让text更具visual特征。还可以搭配上DPOK fine-tune UNet的方法一起使用,效果更佳。可以用于fix hands。

 

TextCraftor

TextCraftor: Your Text Encoder Can be Image Quality Controller

TextCraftor

  1. 类似于TexForce。

 

HPS

Human Preference Score: Better Aligning Text-to-Image Models with Human Preference

train human reference classifier, LoRA fine-tune StableDiffusion。

 

PAHI

Model-Agnostic Human Preference Inversion in Diffusion Models

  1. 使用蒸馏出的一步生成的模型进和打分模型,重参数法优化初始噪声的高斯分布的均值和方差。

  2. 对于某个prompt,从标准高斯分布中随机一个噪声,再从重参数法的高斯分布中随机一个噪声,使用模型分别生成两个样本,使用打分模型分别打分,交叉熵优化均值和方差,使得后者得分更高。

  3. 可以对某个prompt专门优化,也可以使用prompt数据集进行优化。

 

SynArtifact

SynArtifact: Classifying and Alleviating Artifacts in Synthetic Images via Vision-Language Model

SynArtifact

 

DRaFT

Directly Fine-Tuning Diffusion Models on Differentiable Rewards

LoRA + gradient checkpointing,使用reward function fine-tune StableDiffusion。

 

AlignProp

Aligning Text-to-Image Diffusion Models with Reward Backpropagation

gradient checkpointing,使用reward function fine-tune StableDiffusion。

 

DRTune

Deep Reward Supervisions for Tuning Text-to-Image Diffusion Models

DRTune

  1. 不同采样方法都可以表示为xt1=atxt+btϵθ(xt,t)+ctϵ

  2. 在使用DRaFT和AlignProp时不再需要gradient checkpointing,直接屏蔽ϵθ(xt,t)xt的梯度,即xt1=atxt+btϵθ(sg(xt),t)+ctϵ,此时xt1xt=at,不需要存储UNet的中间计算结果。

 

DDPO

Training Diffusion Models with Reinforcement Learning

Policy Gradient fine-tune pre-trained diffusion model,公式和DPOK一样。

DDPO

这个梯度是diffusion model每一步sampling的梯度的加和,不是一条链。

 

BoigSD

Behavior Optimized Image Generation

利用DDPO,align SD with a proposed BoigLLM-defined reward。

 

D3PO

Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model

 

Diffusion-KTO

Aligning Diffusion Models by Optimizing Human Utility

Diffusion-KTO

 

PRDP

PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models

 

Parrot

Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation

预定义K个指标,训练时随机选择一个指标,在prompt前prepend这个指标的reward-specific identifier,使用DDPO进行训练。

生成时把K个reward-specific identifier concat在一起prepend到prompt。

 

VersaT2I

VersaT2I: Improving Text-to-Image Models with Versatile Reward

  1. ChatGPT生成N个prompt,每个prompt用StableDiffusion生成K张图像,使用reward model为图像打分,每个prompt选出打分最高的图像,最终得到N​对prompt和图像,LoRA fine-tune StableDiffusion,LoRA加在所有cross-attention上。

  2. 不同aspect的reward model各训练出一个LoRA ΔWi,再使用所有选出的数据训练一个LoRA路由,o=W0x+i=1LSoftmax(xWg)ΔWix

 

CoMat

CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching

CoMat

  1. DPOK,类似fine-tune版本的TokenCompose。

  2. Lcap AR teacher forcing的next token prediction loss之和。

  3. Ltoken forces the model to activate the attention of the object tokens only inside the region. 即让cross-attention map中object token那一列的响应值集中于mask的那几行。

  4. Lpixel forces every pixel in the region to attend only to the object tokens by a binary cross-entropy loss. 即让cross-attention map中mask那几行的响应值集中于object token那一列。

 

DPT

Discriminative Probing and Tuning for Text-to-Image Generation

DPT

  1. 提取StableDiffusion的feature,送入一个Q-Former,使用global matching(CLIP loss)和local grounding(classification,bounding box)任务训练Q-Former。

  2. 训练完成后,给StableDiffusion的所有cross-attention加上LoRA,使用相同的loss一起训练Q-Former和LoRA。

  3. 生成时进行self-correction,对global matching的CLIP loss求梯度作为guidance。

 

SELMA

SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data

SELMA

 

Language

BDM

Bridge Diffusion Model: bridge non-English language-native text-to-image diffusion model with English communities

利用ControlNet实现StableDiffusion的中文控制。

ControlNet输入变为xt,但在cross-attention层使用Chinese CLIP引入中文,训练时StableDiffusion的text输入设为空串,否则会impede对中文的建模。

 

Taiyi-Diffusion-XL

Taiyi-Diffusion-XL: Advancing Bilingual Text-to-Image Generation with Large Vision-Language Model Support

 

PEA-Diffusion

PEA-Diffusion: Parameter-Efficient Adapter with Knowledge Distillation in non-English Text-to-Image Generation

PEA-Diffusion

用feature之间的L2 loss代替KD loss。

 

AltDiffusion

AltDiffusion: A Multilingual Text-to-Image Diffusion Model

AltDiffusion

 

LLMDiffusion

An Empirical Study and Analysis of Text-to-Image Generation Using Large Language Model-Powered Textual Representation

LLMDiffusion

  1. The pre-trained CLIP model can merely encode English with a maximum token length of 77​. Moreover, the model capacity of the text encoder from CLIP is relatively limited compared to Large Language Models (LLMs), which offer multilingual input, accommodate longer context, and achieve superior text representation. In this paper, we investigate LLMs as the text encoder to improve the language understanding in text-to-image generation.

  2. 在stage 1时,当text length超过77时,切割成多句使用CLIP编码再concat在一起。

 

Resolution

Mixture of Diffusers

Mixture of Diffusers for Scene Composition and High Resolution Image Generation

Mixture

分区域生成,每个区域对应一个prompt

harmonization的关键在于:

  1. 每一步都进行融合

  2. 相邻区域要有overlap,overlap部分进行weighted sum,harmonization就是通过overlap部分传递的

 

MultiDiffusion

MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation

和Mixture of Diffusers类似,不同的是,MultiDiffusion是对去噪结果zt1进行padding,而Mixture of Diffusers是对预测的噪声进行padding。

 

StreamMultiDiffusion

StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based Semantic Control

加速MultiDiffusion。

 

SyncDiffusion

SyncDiffusion: Coherent Montage via Synchronized Joint Diffusions

MultiDiffusion只能保证相邻的子区域的图片风格一致,无法保证全局风格一致。

选一个子区域作为锚点,每一步去噪前,计算所有子区域的xt与锚点子区域的xt的LPIPS score(用xt估计出的x^0计算),求梯度作为guidance更新所有子区域的xt,更新好后再进行MultiDiffusion的流程。

 

SyncTweedies

SyncTweedies: A General Generative Framework Based on Synchronized Diffusions

SyncTweedies

Z是canonical space(比如panorama),{Wi}in是instance space(比如正常尺寸图像),扩散模型都是在Wi上训练的,nWi可以是相同的也可以是不同的,f是projection(比如切割到正常尺寸),g是unprojection(比如padding到panorama的尺寸),ϕ是Tweedie formula预测x^0ψ​是计算后验均值的公式。

MultiDiffusion和SyncDiffusion对应case 3。

本文发现case 2效果最好。

 

SSL-guided

Learned representation-guided diffusion models for large-image generation

用图像的某个patch和这个patch对应的预训练SSL模型提取的feature训练diffusion model。

生成时先生成feature,再利用MultiDiffusion的方法,逐个patch进行overlap生成。

 

CutDiffusion

CutDiffusion: A Simple, Fast, Cheap, and Strong Diffusion Extrapolation Method

CutDiffusion

  1. 分为两个阶段[T,T][T,0],每个patch就是原diffusion model的生成尺寸。

  2. 第一阶段[T,T]负责生成主体结构,使用non-overlap的patch进行采样得到采样结果,对于所有patch的同一个位置的pixel进行随机排列(如4个patch的第一个位置分别是1,5,9,13,随机排列成9,5,1,13,那么第一个patch的第一个位置就是9,第二个patch的第一个位置就是5,依此类推),enabling pixels to contribute to the denoising of other images and promoting similarity in content generation across patches.

  3. 第二阶段[T,0]负责refine,类似MultiDiffusion,使用overlap的patch进行采样后取overlapping area的平均。

 

Variable-Size-Diffusion

Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis

根据attention entropy理论,只需要修改attention的scaling factor就可以使模型生成不同大小的图片。

 

ScaleCrafter

ScaleCrafter: Tuning-free Higher-Resolution Visual Generation with Diffusion Models

  1. 预训练好的StableDiffusion不能直接生成更高分辨率图片的原因是卷积核感受野受限。

  2. we propose a simple yet effective re-dilation that can dynamically adjust the convolutional perception field during inference

 

FouriScale

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

FouriScale

  1. 和ScaleCrafer类似,都把问题归因于卷积核,在生成更高分辨率图片对feature map就行低通滤波并对卷积核进行dilation。

 

Any-Size-Diffusion

Any-Size-Diffusion: Toward Efficient Text-Driven Synthesis for Any-Size HD Images

autoencoder不动,LoRA fine-tune StableDiffusion,预定义一些长宽比,每个长宽比对应一个图像长宽,训练时,根据图像长宽比找到一个最近的预定义长宽比,将图像resize到其对应的图像长宽,进行训练。这样就可以给定任意长宽比的噪声生成图像。

利用StableSR的tiled sampling进行超分,类似MultiDiffusion,可以超分到任意分辨率。

 

Self-Cascade

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Self-Cascade

定义一个递增的分辨率序列,只需要最低分辨率上训练好的diffusion model。

训练时,任选一个分辨率的x0进行加噪得到xt,在去噪xt时,将上一低分辨率的x0输入网络,得到UNet中一些feature,输入到只有少量参数的upsampler,将输出加在xt对应的feature上,这样相当于把上一分辨率的x0作为条件。

采样时,先从最低分辨率采样得到样本,加噪到某一中间步后上采样到下一个更高的分辨率,继续采样,以此循环,直到最高分辨率。

 

DiffCollage

DiffCollage: Parallel Generation of Large Content with Diffusion Models

考虑一张组合图[x1,x2,x3],其中[x1,x2]是原始图像,以[x2]为条件做outpainting生成[x3]

p(x1,x2,x3)=p(x1,x2)p(x3|x2)=p(x1,x2)p(x2,x3)p(x2)

对应的score为logp(x1,x2,x3)=logp(x1,x2)+logp(x2,x3)logp(x2)

可以分别训练两个模型,一个拟合原始图像[x1,x2],一种拟合部分图像[x2],然后进行联合采样。

 

ElasticDiffusion

ElasticDiffusion: Training-free Arbitrary Size Image Generation

  1. 扩散模型在H×W上训练,training-free地采样任意分辨率H¯×W¯的图像。

  2. CFG采样公式ϵθ(xt)+(1+ω)(ϵθ(xt,c)ϵθ(xt))可以看成两个部分,即unconditional score ϵθ(xt)和class direction score ϵθ(xt,c)ϵθ(xt),we use two key insights. First, the class direction score primarily dictates the image’s overall composition, while the unconditional score enhances detail at the pixel level in a more local manner. Second, the unconditional score requires a pixel-specific precision, contributing to the image’s fine-grained details, while class direction score affects pixels collectively, defining the image’s overall composition. 因此unconditional score需要精确计算,class direction score只需要大概计算。

  3. 对于unconditional score,之前的方法都是带overlap的分patch采样,在overlap处取平均,(每个patch H×W,所有patch覆盖H¯×W¯),如MultiDiffusion等。ElasticDiffusion使用另一种方法,将H¯×W¯分成non-overlap的patch,每个patch h×wh<H,w<W,在采样时使用patch周围的pixel作为context拼成H×W输入ϵθ​,输出只保留当前patch的预测,最终所有patch的预测拼在一起即为unconditional score。相比于overlap采样,极大地较少了网络调用次数。

  4. 对于class direction score,将x¯tRH¯×W¯×3下采样到xtRN×M×3,其中H¯W¯=NMN×M尽可能靠近H×W,之后使用一个随机的出色背景将xt填补到H×W,输入ϵθ,将预测结果中填补的部分去掉,再上采样到H¯×W¯​,输入网络时输入条件和不输入条件各预测一次,两者相减得到class direction score。为了防止latent signal的统计量发生变化,这里的上下采样都使用nearest-neighbors mode.

  5. 由于计算class direction score时上下采样都使用nearest-neighbors mode,所以H¯×W¯中大量局部像素都共用相同的class direction score,这会导致生成图像过于平滑,因此借鉴resample technique,第一次预测得到class direction score后,之后重复预测R次,每次使用新预测结果随机替换class direction score中20%​的位置的结果。

  6. Reduced-Resolution Guidance:使用unconditional score估计出一个x^0u;在4中,输入网络时输入条件和不输入条件各预测一次可以得到两个score,使用CFG进行采样,估计出x^0cRN×M×C,上采样到H¯×W¯,与x^0u​相减,求梯度作为另一个额外的guidance,strength为s。Since the overall image structure is determined in the early diffusion steps, we start with s=200 linearly decrease this weight until 60% for the diffusion steps are completed.

 

MagicScroll

MagicScroll: Nontypical Aspect-Ratio Image Generation for Visual Storytelling via Multi-Layered Semantic-Aware Denoising

MultiDiffusion

 

FiT

FiT: Flexible Vision Transformer for Diffusion Model

 

BeyondScene

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

  1. 针对有pose和prompt的高分辨率人物生成。

 

Personalization

direct: 使用一个已有的token,对其token embedding进行优化或适配

transform: 由一个网络将视觉信息转换为token embedding或residual

attach: 附在已有prompt之后

no pseudo word: 不需要使用已有的token或新添加token

 

Subject

TI (direct)

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

  1. StableDiffusion

  2. S

  3. diffusion loss只优化token embedding(text encoder前的embedding)。

 

CustomDiffusion (direct)

Multi-Concept Customization of Text-to-Image Diffusion

  1. StableDiffusion

  2. [V] class

  3. 同时训练token embedding和cross-attention KV projection matrix。类似DreamBooth,构造一个regularization dataset解决language drift问题。相当于只fine-tune cross-attention KV projection matrix的StableDiffusion版本的DreamBooth。

  4. 可以同时在多组reference images上进行训练,生成时可以使用多个pseudo words构造prompt。

 

DreamBooth (direct)

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

  1. Imagen

  2. [V] class

  3. token embedding毕竟表达能力有限,效果不好,所以选择优化token embedding的同时也fine-tune整个模型(包括text encoder)。

  4. fine-tuning有overfitting + language drift的缺点,所以提出Class-specific Prior Preservation Loss,类似Continual Learning的replay方法,生成一些样本和新样本一起作为训练集进行训练,防止过拟合。

  5. 改进版使用LoRA fine-tune diffusion model。

LoRA

LoCon

 

ViCo (direct)

ViCo: Plug-and-play Visual Condition for Personalized Text-to-image Generation

ViCo

  1. StableDiffusion

  2. S

  3. 将reference image作为visual condition引入网络。

  4. 使用zt和reference image分别进行text cross-attention,之后加一个image cross-attention,zt的text cross-attention的输出作为Q,reference image的text cross-attention的输出作为KV。

  5. 使用reference image的text cross-attention map估计出一个mask,用这个mask过滤KV,只保留mask内的KV(KV长度变小),让Q只与有object的KV进行计算。

 

HyperDreamBooth (no pseudo word)

HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models

  1. 使用CelebA-HQ数据集,训练一个HyperNetwork预测StableDiffusion的所有attention层的LoRA参数去重构图像。StableDiffusion输入统一的"a [V] face"的prompt,其中"[V]"是稀有单词,这里不优化"[V]"的token embedding,因为作者发现只需要LoRA参数,就可以用"[V]"随意造句进行生成了。

  2. 测试时,先使用HyperNetwork预测LoRA参数作为初始化,然后再进行LoRA fine-tune,fine-tune速度比DreamBooth快25倍。

  3. HyperNetwork架构类似Q-Former,使用迭代法从零初始化的参数生成最终参数,预测出的LoRA参数加在StableDiffusion上计算diffusion loss优化HyperNetwork。

HyperDreamBooth

 

XTI (direct)

P+: Extended Textual Conditioning in Text-to-Image Generation

  1. StableDiffusion

  2. 定义P+空间:UNet每层cross-attention使用的text embedding的集合。不同层使用不同text embedding有不同的效果。

  3. P+空间的TI:对于某个concept,不同cross-attention层使用不同token embedding进行优化,在StableDiffusion中就是16个不同的token embedding。

  4. 只优化token embedding,不优化模型参数。

  5. 不同层输入不同concept的TI得到的token embedding,还可以达到semantic composition的效果。

 

ProSpect (direct)

ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models

  1. StableDiffusion

  2. 定义P空间:将1000步均分为10个阶段,每个阶段使用一个单独的token embedding进行训练。

  3. 不同阶段使用不同reference的token embedding,可以实现material、style、layout的transfer生成与编辑。

 

NeTI (direct)

A Neural Space-Time Representation for Text-to-Image Personalization

  1. StableDiffusion

  2. P+空间是spatial层面的扩展,但diffusion model不同时间步性质表现都不同,所以在时间维度上继续扩展P+空间到P空间,即不同时间步不同cross-attention都使用不同的token embedding。但此时要训练的token embedding就太多了,所以训练一个neural mapper,输入时间步t和cross-attention所在的层数l,输出一个768维的向量作为token embedding。

  3. During optimization, the outputs of the neural mapper are unconstrained, resulting in represen tations that may reside far away from the true distribution of token embeddings typically passed to the text encoder. We set the norm of the network output to be equal to the norm of the embedding of the concept’s “supercategory” token. 例如学习一个cat相关的concept,最终输出为M(t,l)M(t,l)vcat,其中vcat为单词cat的word embedding。

  4. neural mapper是一个MLP,其最后一个hidden layer前的hidden latent h的维度为dh,we find the dh in our mapper greatly affects the tradeoff between the reconstruction quality and the editability of the inverted concept. Theoretically, one can train multiple neural mappers with different representation sizes and choose the one that bests balances reconstruction and editability for each con cept. However, doing so is both cumbersome and imprac tical at scale. 所以在训练时,每一步都随机采样一个tU(0,dh),将h[i>t]部分全部dropout为0,which encourages the network to be robust to different dimensionality sizes and encode more information into the first set of output vectors, which have a lower truncation frequency. 采样时也可以控制这个dropout,如果使用大dropout进行采样,生成的concept就比较粗糙,但可编辑性更强。

  5. Inverting a concept directly into the UNet’s input space, without going through the text encoder, could potentially lead to much quicker convergence and more accurate reconstructions. 所以让neural mapper输出两个向量,一个向量是token embedding,和其他单词一起送入CLIP text encoder,另一个向量不过CLIP text encoder,而是直接加在该CLIP text encoder输出的text embedding上,同样使用上面的normalization,防止过拟合。但是这个额外的向量只加在UNet的cross-attention层的value上,key使用不加额外向量的text encoder的输出,原理同key-locking。

 

PerFusion (direct)

Key-Locked Rank One Editing for Text-to-Image Personalization

  1. StableDiffusion

  2. Personlization的两个主要目标是avoid overfitting和preserve the identity,但这两个目标天然存在trade-off,to improve both of these goals simultaneously, our key insight is that models need to disentangle what is generated from where it is generated.

  3. cross-attention中key决定了where it is generated,value决定了what is generated,所以只训练WV,同时ROME编辑WK,让WK与pseudo embedding的计算结果和WK与supercategory embedding (teddy)的计算结果靠近。

  4. A natural solution is then to edit the weights of the cross-attention layers, WV and WK using ROME. Specifically, when given a target input iHugsy we enforce the K activation to emit a specific target output oHugsyK=Kteddy. Similarly, given a target-input iHugsy, we enforce the V activation to emit a learned output oHugsyV=Vteddy.

PerFusion

 

CrossInitialization (direct)

Cross Initialization for Personalized Text-to-Image Generation

  1. TI使用supercategory初始化token embedding v,但如此优化后,会发现最终学到的v的scale增大了几十上百倍,这种较大的变化说明这种v的初始化方法是不够好的,导致优化慢、过拟合、缺乏可编辑性。

  2. 查看token embedding v和其经过text encoder编码后对应位置的embedding E(v),发现text encoder会不断改变token embedding的大小和方向,并且TI优化后,vE(v)的大小和方向会很相似。

  3. 对于某个token embedding v,如果把它换成E(v)再输入text encoder,最终生成效果相似,有点不动点的意思。

CrossInitialization

  1. 这说明TI优化是目标是v=E(v),所以可以使用supercategory的token embedding v的text encoder编码后对应位置的embedding E(v)初始化v,并且加入正则,使v不要离v太远。

 

DP (direct)

A Data Perspective on Enhanced Identity Preservation for Diffusion Personalization

  1. DreamBooth。

  2. 构造更好的regularization dataset。

Data-Oriented

 

UFC (direct)

User-Friendly Customized Generation with Multi-Modal Prompts

UFC

  1. 使用BLIP和ChatGPT构造更好的regularization prompt、customized prompt和generation prompt。

 

SID (direct)

Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization

SID-1

SID-2

  1. DreamBooth。

  2. 类似DP,在训练时使用尽可能详细的描述,这样可以以减少pseudo中对不相关内容的bias。

  3. 作者总结了几种常见的bias,利用VLM生成含有这些bias描述的句子。

 

CLiC (direct)

CLiC: Concept Learning in Context

CLiC

  1. StableDiffusion

  2. Custom-Diffusion的RoI版本,对RoI区域的物体进行TI,同时优化cross-attention KV projection matrix。

  3. lcon即为diffusion loss,但提高了RoI区域的比重,目的是学习RoI区域在context中的pattern;lattn提取token对应的cross-attention map,提升RoI区域内的响应值,抑制RoI区域外的响应值;lRoI使用只包含token的句子和只包含RoI区域的图像进行TI。

  4. SDEdit + Blended进行编辑。

 

EM-Optimization (direct)

Visual Concept-driven Image Generation with Text-to-Image Diffusion Model

使用CLIP text encoder编码super class name初始化token,然后使用EM算法优化:

E-step:随机选择50个步数,对reference image加噪,和带pseudo word的prompt一起送入StableDiffusion,提取pseudo word对应的cross-attention map,取平均,阈值法求出一个mask。

M-step:使用上述mask,masked diffusion loss + masked cross-attention loss优化pseudo word embedding。

 

PALP (direct)

PALP: Prompt Aligned Personalization of Text-to-Image Models

  1. LoRA版的DreamBooth

  2. test-time fine-tune时,不仅要提供reference image,还要提供生成时需要的prompt,比如"a sketch of [V]",即每次生成前都要进行fine-tune。

  3. Personlization的一个问题是过拟合,过拟合的模型只需要一步就可以从纯噪声预测出subject的形状和特征。Our key idea is to encourage the model’s denoising prediction towards the target prompt。

PALP-1

  1. 除了diffusion loss,加入SDS loss,让根据prompt的预测靠近根据clean prompt的预测。

PALP-2

 

CustomSketching (direct)

CustomSketching: Sketch Concept Extraction for Sketch-based Image Synthesis and Editing

CustomSketching

 

IP-Adapter (no pseudo word, no test-time fine-tuning)

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

 

DreamTuner (no pseudo word, no test-time fine-tuning / direct)

DreamTuner: Single Image is Enough for Subject-Driven Generation

DreamTuner-1

  1. 类似ViCo的思想,将reference image的特征引入StableDiffusion就能进行subject-driven generation。

  2. Subject-Encoder:为了解耦内容和背景特征,使用Salient Object Detection去除背景;为了解耦内容和位置特征,可以用预训练的ControlNet引入位置信息,这样学到的都是content特征。

  3. Subject-Encoder-Attention:StableDiffusion的self-attention和cross-attention之间插入一个可训练的cross-attention层(S-E Attention),对reference image进行重构,reference image的self-attention附加到generated image的self-attention中提供参考。

DreamTuner-3

  1. Self-Subject-Attention:The features of reference image extracted by the text-to-image U-Net model are injected to the self-attention layers, which can provide refined and detailed reference because they share the same resolution with the generated features. 生成时每一步直接对reference image随机加噪,输入UNet,提取self-attention layers的key和value,与生成时的self-attention layers的key和value进行如上交互。

  2. 使用上述方法,即使不训练pseudo word embedding,也能进行personlization生成。但使用DreamBooth的方法训练一个pseudo word embedding+fine-tune diffusion model,效果更好。

 

FreeTuner (no pseudo word, no test-time fine-tuning)

FreeTuner: Any Subject in Any Style with Training-free Diffusion

FreeTuner

  1. 类似DreamTuner。

  2. three feature swap operations:1) cross-attention map swap: 将reconstruction branch的subjected-related cross-attention map注入personalized branch,如这里的horse。 2) self-attention map swap: 将reconstruction branch的self-attention map的Msub部分注入personalized branch的self-attention map的相同部分,即MsubSAt+(1Msub)SA~t。 3) latent swap: 将reconstruction branch的ztMsub部分注入personalized branch的self-attention map的相同部分,即Msubzt+(1Msub)z~t。这样即使不训练pseudo word embedding,也能进行personlization生成。

  3. 如果还有style image,使用VGG-19提取feature计算相似度,求梯度作为guidance。

 

SSR-Encoder (no pseudo word, no test-time fine-tuning)

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

SSR-Encoder

  1. 类似ViCo的思想,将reference image的特征引入StableDiffusion就能进行subject-driven generation。

  2. 使用CLIP text encoder编码query,使用CLIP image encoder编码reference image得到sequence feature,两个feature计算得到cross-attention map,提取CLIP image encoder不同层的sequence feature作为V,共K层,与cross-attention map计算出SSR feature,长度为K,IP-Adapter引入StableDiffusion,训练引入的各种projection matrix。

  3. 使用text-image pair自监督训练,提取关键词作为query,xt​就是reference image的加噪结果。

  4. 这说明CLIP image encoder编码图像得到的sequence feature也是可以用于计算相似度的,不只是CLS token feature可以。

 

Mask-ControlNet (no pseudo word, no test-time fine-tuning)

Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt

Mask-ControlNet

 

DreamMatcher (direct)

DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization

  1. 针对TI生成过程的优化,对于不同TI方法都适用,如DreamBooth和CustomDiffusion等。

  2. self-attention有两个作用,一是由QK计算出的attention map控制的图像结构,二是由V控制的visual attributes,如颜色、纹理。

  3. TI方法生成的图像,concept的结构都比较好,但是一些具体细节,如颜色、纹理,都和reference image中的concept有出入,所以本方法通过修改TI生成过程中self-attention的V做appearance保持。

  4. 具体做法是dual branch,先对reference image进行DDIM Inversion再重构,得到reconstructive trajectory,另一条从随机噪声出发,带pseudo word的prompt为条件,得到generative trajectory。由于生成图像中concept的位置不确定,和reference image中的concept位置不一致,所以直接用reconstructive trajectory中的V替换generative trajectory中对应的V会出现位置不匹配的问题,所以使用两条trajectory的UNet decoder的一些feature做semantic corresponce,根据semantic corresponce计算出dense displacement field,根据dense displacement field对reconstructive trajectory中的V做warp,使用warp后的V替换generative trajectory中对应的V。

 

DVAR (direct)

Is This Loss Informative? Speeding Up Textual Inversion with Deterministic Objective Evaluation

  1. 提出一种early stopping criterion,加速TI接近15倍,并且效果没有明显下降。

 

PACGen (direct)

Generate Anything Anywhere in Any Scene

  1. DreamBooth学到的word也可以用在GLIGEN这种plug-and-play模型。但DreamBooth的一个缺点是不能解耦object和位置的信息,使用GLIGEN这种有额外layout信息的模型进行生成时,一旦修改了位置,就无法很好的生成object。

  2. 实用数据增强方法训练DreamBooth:by incorporating a data augmentation technique that involves aggressive random resizing and repositioning of training images, PACGen effectively disentangles object identity and spatial information in personalized image generation.

 

CI (direct)

Compositional Inversion for Stable Diffusion Models

  1. existing methods often suffer from overfitting issues, where the dominant presence of inverted concepts leads to the absence of other desired concepts. It stems from the fact that during inversion, the irrelevant semantics in the user images are also encoded, forcing the inverted concepts to occupy locations far from the core distribution in the embedding space.

  2. Textual Inversion will make the new (pseudo-)embeddings OOD and incompatible to other concepts in the embedding space, because it does not have enough interactions with others during the post-training learning。加入正则项,使得学到的embedding和一些已知的且相关的concept的embedding不要太远,比如给定和猫相关的reference images时,使得学到的embedding和cat, pet等的embedding靠近。这样学到的embedding更具一般性,和其他单词组合造句时就像用cat造句一样,模型可以识别,也可以和其他学到的embedding组合造句进行multi-concept generation。

 

SuDe (direct)

SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation

SuDe

  1. 如果用传统方法学到的pseudo word进行造句,比如"[V] is running",模型不能正确生成running,但如果使用base class造同样的句子却可以正确生成,这说明学出来的pseudo word并不能继承base class的属性。

  2. 为TI引入正则,让学出来的pseudo word继承base class的属性,最小化[xθ(xt,p[V],t)xθ¯(xt,pbase,t)],其中θ¯是deatch的意思,表明这一项的梯度并不回传。可以用在不同方法上。思想与PALP中防止过拟合类似。

 

ProFusion (direct)

Enhancing Detail Preservation for Customized Text-to-Image Generation A Regularization-Free Approach

  1. 使用不加任何正则项的TI得到token embedding。

  2. 之前的工作加正则项是为了防止过拟合,但也导致了信息提取不充分。本论文提出Fusion Sampling解决这一问题。

 

DisenBooth (direct)

DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation

  1. DreamBooth

  2. 之前的工作如TI和DreamBooth都是为reference images优化一个token,DisenBooth除此之外还为每张reference image编码一个独立的subject-unrelated token,这样有助于学习到所有reference images共有的subject的特征,而忽略每张reference image其它细节(如背景等)。

  3. {xi}i=1KP="a photo of [V] dog",fs=ET(P)fi=maskEI(xi)+MLP(maskEI(xi)),其中mask是一个可学习的向量,用于过滤subject-related信息,skip-connected MLP用于将CLIP image embedding转换为text embedding。L1=i=1Kϵiϵθ(zi,ti,ti,fs+fi)22L2=i=1Kϵiϵθ(zi,ti,ti,fs)22L3=i=1Kcos<fs,fi>降低subject-related与subject-unrelated的相似性。

  4. 使用LoRA进行fine-tune。

 

DreamArtist (direct)

DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Contrastive Prompt-Tuning

  1. 类似DisenBooth

  2. 借鉴classifier-free guidance,学习两个pseudo-words,positive pseudo word用于提取主要特征(相当于y),negtive pseudo word用于去除多余的特征(相当于ϕ)。具体做法是两个word都用"a photo of []"造句,输入StableDiffusion得到两个输出,使用classifier-free guidance公式计算ϵθ,Tweedie's formula根据ztϵθ计算z^0,使用z的MSE Loss和StableDiffusion的VAE decoder解码后的pixel L1 Loss,促使pseudo-word学习pixel-level的细节。

  3. 生成时使用negtive pseudo word的输出作为u进行cfg生成。

 

StyO (direct)

StyO: Stylize Your Face in Only One-Shot

  1. StableDiffusion

  2. one-shot face stylization: applying the style of a single target image to the source image。

  3. 构造content和style单词,使用三个数据集进行TI,同时也fine-tune StableDiffusion,其中target和source都只有一张图像。

StyO-1

之后使用该prompt进行生成:

StyO-2

 

DreamDistribution (direct on prompt)

DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models

DreamDistribution

有点类似De-Diffusion,但不是显式的caption。

K个prompt,每个prompt是一个可训练的word embedding序列,编码后求均值和方差,重参数采样后送入预训练StableDiffusion,优化所有prompt。

生成时采样即可。

 

SingleInsert (transform)

SingleInsert: Inserting New Concepts from a Single Image into Text-to-Image Models for Flexible Editing

  1. StableDiffusion

  2. 借鉴Break-A-Scene,使用DINO或SAM对intended concept做分割,使用masked diffusion loss进行训练。

  3. 两阶段训练:第一阶段做TI,只训练image encoder;第二阶段fine-tune encoder+diffusion model。

  4. 输入不带pseudo word的text计算Lbg to minimize the impact of the learned embedding on the background area.

SingleInsert

 

ELITE (transform)

ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

  1. StableDiffusion

  2. 两阶段训练:

ELITE

global:使用CLIP作为feature extractor提取reference image feature,使用一个global mapping network将CLIP不同层的feature映射为不同token embedding,最深层的feature预测的token embedding对应subject-related information,浅层的feature预测的token embedding对应subject-unrelated information,同时训练global mapping network和cross-attention KV projection matrix。

local:去除reference image背景,使用CLIP作为feature extractor提取其feature,使用一个local mapping network将CLIP feature映射为token embedding,这里只使用最深层的word,额外添加一组cross-attention KV projection matrix进行训练,同时训练local mapping network和new cross-attention KV projection matrix。此时cross-attention的输出是global与local cross-attention的输出的和,global cross-attention依然使用global阶段生成的token embedding作为输入,且只使用最深层的word,但不参与训练。这一阶段类似LoRA,让模型将更多细节绑定到global阶段生成的word embedding上。

 

E4T (transform)

Designing an Encoder for Fast Personalization of Text-to-Image Models

  1. StableDiffusion

  2. Texutal Inversion shows that the word embedding space exhibits a trade-off between reconstruction and editability. This is because more accurate concept representations typically reside far from the real word embeddings, leading to poorer performance when using them in novel prompts. StyleGAN inversion也有这种问题, a two-step solution which consists of approximate-inversion followed by model tuning. The initial inversion can be constrained to an editable region of the latent space, at the cost of providing only an approximate match for the concept. The generator can then be briefly tuned to shift the content in this region of the latent space, so that the approximate reconstruction becomes more accurate.

  3. 每个domain (like face, cat, dog, etc.)训练一个编码器E,将image concept Ic编码为word embedding空间的一个偏移量,ec=edomain+sE(Ic),which constrains our predicted embeddings to reside near the fixed word embedding of the domain’s coarse descriptor edomain,输入StableDiffusion,同时使用LoRA fine-tune cross-attention projection matrix,重构图像,类似Custom-Diffusion。

  4. 使用E(Ic)22作为正则项。

  5. 每个domain先在各自的大数据集上进行预训练,再在给定的几张图像上进行test-time fine-tuning,都用一样的训练方法。

 

Domain-Agnostic E4T (transform)

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

引入contrastive-based regularization technique,让encoder可以处理不同domain的数据。

 

Cones (direct, no test-time fine-tuning)

Cones: Concept Neurons in Diffusion Models for Customized Generation

  1. StableDiffusion

  2. training-free

  3. 对每一组concepts,在cross-attention层的KV参数中,找到那些屏蔽掉后能够降低DreamBooth Loss(Reconstruction Loss+Preservation Loss)的神经元(Concept Neurons),不用训练,直接屏蔽掉这些神经元,就能得到对这组concepts敏感的text2img模型。pseudo word用一些已有但不常用的单词,比如AK47等。

Cones

 

Cones2 (transform)

Cones 2: Customizable Image Synthesis with Multiple Subjects

Cones2

  1. 对于某个class的subject,学习一个该class的token的residual token embedding。做法是TI训练text encoder,但这样会使整个句子中的单词偏向subject。加入正则项:使用ChatGPT对class进行造句,分别使用训练后的text encoder和原text encoder对每个句子进行编码,使得句子中非class的单词的token embedding训练前后尽量靠近。最后的residual token embedding也是所有造句中class token embedding的差的均值。(注意,某个单词单独的embedding和其在句子中的embedding是不同的)

  2. 这样每个residual token embedding都是可重复利用的,也可以和别的residual token embedding同时使用,还可以操作cross-attention map指定concept的位置。

 

HiPer (attach)

Highly Personalized Text Embedding for Image Manipulation by Stable Diffusion

  1. StableDiffusion

  2. 为参考图写一句话,但不包含pseudo word,而是利用text embedding后面的空位,加上personalized embedding,训练时只优化personalized embedding。

personalize

 

HiFi Tuner (attach)

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

和HiPer类似,优化最后5个embedding。

 

CatVersion (no pseudo word)

CatVersion: Concatenating Embeddings for Diffusion-Based Text-to-Image Personalization

  1. StableDiffusion

  2. 将base class word(如dog)输入CLIP,在CLIP的最后3个self-attention层,给key和value分别concat一个可训练的residual embedding,即Kf+ΔKVf+ΔV,使用reference image训练ΔKΔV

CatVersion

 

SuTI (no pseudo word, no test-time fine-tuning)

Subject-driven Text-to-Image Generation via Apprenticeship Learning

  1. Imagen

  2. 对每个concept,使用{ 3-10张该concept的图片,该concept对应的文本( 比如berry bowl)} fine-tune一个Imagen模型(让模型将文本与给定图片的视觉特征绑定),之后用这个concept文本构造prompt(比如 berry bowl floating on the river),使用fine-tune后的模型,根据prompt生成目标图片,使用Apprenticeship Learning训练一个diffusion大模型,目标图片作为x0{3-10张同一concept的图片,concept对应的文本,prompt}为条件。

  3. 这样,使用训练好的大模型,给定3-10张unseen concept的图片和这个concept对应的文本,使用这个文本随便构造prompt,就可以生成和prompt和3-10张unseen concept图片都对齐的图像。

 

Obeject Encoder (no pseudo word, no test-time fine-tuning)

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

  1. Imagen

  2. 只适用于人脸和动物等domain的个性化,并不能做到open domain的个性化。

  3. 对于每个domain,使用该domain的数据集进行训练:去除每张image的背景,训练一个object encoder提取object特征,并使用caption模型生成image的text,使用object特征和text两个条件fine-tune Imagen,使用一些正则防止过拟合。

  4. 训练好的模型可以根据reference image的物体特征和用户写的prompt自由生成。

 

InstantBooth (transform, no test-time fine-tuning)

InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

  1. StableDiffusion

  2. 类似Object Encoder,只适用于人脸和动物等domain的个性化,并不能做到open domain的个性化。

  3. 对于每个domain,使用该domain的数据集进行训练:把每张image看成一个concept进行训练,训练一个encoder编码image得到两个特征,一个concept特征,一个visual特征,concept特征替换text embedding中pseudo word所在位置的embedding,同时将visual特征通过GLIGEN引入StableDiffusion,同时训练encoder和GLIGEN的adapter,使用数据增强和去除背景等方法防止过拟合。并不优化pseudo word的token embedding。

  4. 推理时可以使用pseudo word构造prompt,encoder编码reference images得到的concept特征取均值后替换text embedding中pseudo word所在位置的embedding。

 

Instruct-Imagen (no pseudo word, no test-time fine-tuning)

Instruct-Imagen: Image Generation with Multi-modal Instruction

Instruct-Imagen

Re-Imagen + Instruction Tuning

Re-Imagen的目的是为了让模型condition on multi-modal input

 

BootPIG (no pseudo word, no test-time fine-tuning)

BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models

BootPIG-1

BootPIG-2

  1. 类似DreamTuner,训练网络直接识别reference image就可以直接生成,不需要test-time fine-tuning。训练整个Reference UNet和Base UNet的self-attention layers里的四个矩阵。

  2. 造数据进行自监督训练。

 

JeDi (no pseudo word, no test-time fine-tuning)

JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation

JeDi

  1. 类似M2M的image sequence生成方法。

 

Multi-Subject

Break-A-Scene (direct)

Break-A-Scene: Extracting Multiple Concepts from a Single Image

  1. 提取一张图中多个concept。

  2. 给定有分割标注的图片,一次性提取图片中不同object的pseudo word,利用masked diffusion loss + masked cross-attention loss进行训练。

 

DisenDiff (direct)

Attention Calibration for Disentangled Text-to-Image Personalization

DisenDiff

  1. CustomDiffusion

  2. 提取一张图中多个concept。

  3. Lbind增大V1和cat以及V2和dog的cross-attention map之间的IoU,Ls&s​减小cat和dog的cross-attention map之间的IoU。

  4. suppress:cross-attention map的平方(element-wise multiplication),抑制low response,增强high response。

 

AttenCraft (direct)

AttenCraft: Attention-guided Disentanglement of Multiple Concepts for Text-to-Image Customization

AttenCraft

  1. CustomDiffusion

  2. 提取一张图中多个concept。

  3. 先不使用pseudo word,使用concept对应的class word,在某个较小的时间步,使用DatasetDiffusion的方法提取每个concept的mask,使用CustomDiffusion的方法学习时,优化每个pseudo word的cross-attention map和对应的mask之间的KL散度。

 

Multi-Subject Composition

Mix-of-Show (direct)

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models

Mix-of-Show

  1. TI和P+这种只优化token embedding的方法,如果reference image是in-domain的,那就够用了,但如果是out-of-domain reference image效果就不好了。

  2. (c) 对于token embedding和模型参数都优化的方法(如DreamBooth和CustomDiffusion),如果只用优化好的token embedding和原模型参数进行生成,生成的都较为相似,说明token embedding捕捉的还是in-domain的信息,out-of-domain的信息蕴藏在更新的模型参数中。

  3. (d) 为了将更多的信息转移到token中,采用P+的layer-wise embedding并使用multi-word embedding。

  4. 单个concept的学完后,如何融合多个LoRA参数ΔWi到一个模型中进行multi-concept生成就是一个问题。简单的LoRA参数ΔWi加权相加生成效果不佳,原因是不同参数会互相影响,使用优化方法得到一个融合参数:W=arg minWin(W0+ΔWi)XiWXiF2

 

LoRA-Composer (direct)

LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models

  1. 解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题,如果像Mix-of-Show中训练一个额外的LoRA融合所有LoRA,会导致concept confusion和concept vanishing。

  2. training-free方法,需要提供不同concept的bounding box。

  3. 将prompt拆分为local prompt,每个local prompt一个concept(pseudo word),每一步生成时,zt输入不同LoRA参数的模型进行一次预测(使用对应的local prompt),在self-attention map和cross-attention map上利用bounding box算几个loss,梯度下降更新zt,优化几次后将zt输入原StableDiffusion去噪,依此循环,目的是让concept生成在对应bounding box内且让不同concept互不影响。

 

MultiBooth (direct)

MultiBooth: Towards Generating All Your Concepts in an Image from Text

MultiBooth-1

MultiBooth-2

  1. 思想类似LoRA-Composer。

 

MC2 (direct)

MC2: Multi-concept Guidance for Customized Multi-concept Generation

MC2

  1. 类似LoRA-Composer,解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题,如果像Mix-of-Show中训练一个额外的LoRA融合所有LoRA,会导致concept confusion和concept vanishing。

  2. training-free方法,但不需要提供不同concept的bounding box。

  3. 将prompt拆分为local prompt,每个local prompt一个concept(pseudo word),每一步生成时,zt输入不同LoRA参数的模型进行一次预测(使用对应的local prompt),得到cross-attention map,两两计算IoU(共n(n1)2个)求平均作为loss,梯度下降更新zt;将优化后的zt继续输入不同LoRA参数的模型进行预测得到zt1c,i,同时输入StableDiffusion使用空文本串预测得到zt1u,计算zt1u+i=1nωi(zt1c,izt1u)作为最终的zt1

 

OMG (direct)

OMG: Occlusion-friendly Personalized Multi-concept Generation in Diffusion Models

OMG-1

OMG-2

  1. 类似LoRA-Composer,解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题,是一种通用的方法,可以用在不同TI方法上,甚至不同TI方法训练出来的pseudo word和LoRA也可以一起生成。

  2. 使用两阶段进行生成。第一阶段先用general class word替代pseudo word,使用原StableDiffusion进行生成,保留生成过程中所有general class word对应的cross-attention map,使用SAM得到生成结果中general class word对应的mask;第二阶段和第一阶段进行一样的生成过程,但在每一步,对于每个concept,使用pseudo word和对应的LoRA进行生成,所有concept预测的噪声使用第一阶段的mask进行blending,同时也使用第一阶段的cross-attention map替换pseudo word对应的cross-attention map,以做到layout preservation。

 

FreeCustom (no pseudo word, no test-time fine-tuning)

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

FreeCustom

  1. training-free方法,需要提供不同concept的mask。

  2. MRSA:inject KV of self-attention in reference path into composition path。

 

OrthoAdaptation (direct)

Orthogonal Adaptation for Modular Customization of Diffusion Models

LoRA fine-tune时,不同concept使用互相正交的B,固定B,只训练A,这样学到的多个concept可以同时生成,正交性使得不同concept的LoRA参数可以直接相加在一起使用。

 

MLoE (direct)

Mixture of LoRA Experts

  1. 解决Mix-of-Show中多个单独训练的pseudo word和LoRA如何一起参与生成的问题,如果像Mix-of-Show中训练一个额外的LoRA融合所有LoRA,会导致concept confusion和concept vanishing。

  2. 类似MoE,训练一个gating function,其根据LoRA的输出计算一个gating value,使用gating value线性组合不同LoRA的输出,使用训练LoRA时的数据和loss进行训练。

 

Break-for-Make (direct)

Break-for-Make: Modular Low-Rank Adaptations for Composable Content-Style Customization

Break-for-Make

  1. DreamBooth+LoRA(加在cross-attention上),需要同时学两个pseudo word(不同的reference image),一个是content,一个是style。有两个baseline:一个是公用同一个LoRA联合训练,另一个是分开学LoRA然后直接加在一起使用。

  2. 做矩阵分解,Wup=[AB]Wdown=[CD],其中ARd×rBRmd×rCRr×dDRr×(nd),所以ΔW=[ACADBCBD],将AC初始化为互相正交并且不参与训练,相当于保持前d维参数,d<min(m,n)。两个pseudo word分别优化BD,这样AD完全由D决定,BC完全由B决定,两者学习各自concept的特征,BD学习两者交互的特征。

 

Cones2 (transform)

Cones 2: Customizable Image Synthesis with Multiple Subjects

 

UMM-Diffusion (transform, no test-time fine-tuning)

Unified Multi-Modal Latent Diffusion for Joint Subject and Text Conditional Image Generation

  1. StableDiffusion

  2. 任务:给一个句子,和句子中某些单词对应的图像,生成句子对应的图像,其中给定图像的单词对应的object要和给定图像相似,相当于可以做composition。类似PbE的self-supervised learning:利用预训练目标检测模型,在LAION数据集上,标注出句子中具体单词对应的object在图像中的位置,构建新的数据集。

  3. 不fine-tune模型,只训练一个MLP,将给定图像的CLIP image embedding转换为token embedding,用TI方法训练这个MLP,类似FastComposer。

UMM

 

Subject-Diffusion (transform, no test-time fine-tuning)

Subject-Diffusion: Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

  1. StableDiffusion

  2. 训练一个open domain并且不需要test-time fine-tuning的模型。

  3. 数据集:使用BLIP为图像生成caption,提取caption中的subject,使用DINO+SAM分割出每个subject对应的bounding box,在caption后加[subject_0] is [placeholder_0], [subject_1] is [placeholder_1]...,构成数据集。

  4. 训练时,使用CLIP image encoder编码每个subject对应的bounding box内的内容,使用编码结果直接替换上述的[placeholder_i]的embedding,并且重新训练text encoder,这样就在建模text前引入融合了图像信息(实验发现这样比建模句子后再融合要好);同时训练cross-attention KV projection matrix(因为他们负责转换text feature);类似GLIGEN在self-attention和cross-attention之间加一个adapter,引入bounding box信息(帮助识别区分多物体)。

  5. 推理时,给定一个caption,在caption之后加[subject_0] is [placeholder_0], [subject_1] is [placeholder_1]...,为每个[placeholder_i]提供一张reference image,还可以为每个[placeholder_i]指定一个bounding box。

 

InstantFamily (no pseudo word, no test-time fine-tuning)

InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

 

SE-Guidance (no pseudo word, no test-time fine-tuning)

Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation

SE-Guidance

  1. 基于IP-Adapter的多subject组合生成。

  2. 为text prompt中某些subject token提供image prompt,阈值法使用subject token对应的text cross-attention map估计出一个mask,乘到对应的image prompt的image cross-attention map上,所有image prompt的image cross-attention的输出加权和。

  3. A&E防止object missing。

 

Concept Discovery

Conceptor

The Hidden Language of Diffusion Models

decomposing an input text prompt into a small set of interpretable elements.

对于某个concept,造句生成的100张图像,找一堆base word,学习一个MLP,为每个base word预测一个权重,所有base word的线性组合去重构这100张图像。目的是学习这个concept可以由哪些base word解释。

 

ConceptLab (direct)

ConceptLab: Creative Generation using Diffusion Prior Constraints

利用DALLE-2生成一些没有过的概念,比如生成和所有已知pet都不同的pet。

LayoutAttn

 

DreamCreature (direct)

DreamCreature: Crafting Photorealistic Virtual Creatures from Imagination

对数据进行无监督聚类得到不同sub-concept,每个sub-concept学习一个token进行TI。

 

Unsupervised Concepts Discovery (direct)

Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models

  1. StableDiffusion

  2. use the combination of the score of different concepts (a learnable word embedding) to reconstruct images using diffusion loss

UnsupervisedConceptDiscovery

N张不同图像中学到K个concept,或从同一场景的图像中学到K个concept,可以组合生成。

 

Non-Subject Inversion

ReVersion (direct)

ReVersion: Diffusion-Based Relation Inversion from Images

TI训练优化一个relation token,提取reference images中共同存在的relation特征而不是object特征,比如握手,之后用relation token造句可以生成具有相同relation的图像。

Relation-Steering Contrastive Learning:relation token应该具有介词词性,使用一个contrastive loss,拉近relation token与已有的介词的距离,拉远relation token与其它词性的单词的距离。

 

Lego (direct)

Lego: Learning to Disentangle and Invert Concepts Beyond Object Appearance in Text-to-Image Diffusion Models

invert any concepts in exemplar images, such as "frozen in ice", "burnt and melted", and "closed eyes"

Lego

使用contrastive learning,构造concept的同义词作为positive,反义词作为negtive,计算InfoNCE loss。

 

ADI (direct)

Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation

  1. TI训练优化一个action token embedding,提取reference images中共同存在的action特征而不是object特征,比如倒立,之后用action token造句可以生成具有相同relation的图像。

  2. 对于某个要学习的action,在每一个cross-attention层都优化一个token,这样就不必局限于单个token,语义更丰富。

  3. 避免学到与action无关的特征:(a,c) is an anchor sample, where a denotes the specific action, and c denotes the action-agnostic context contained in the image including human appearance and background,使用其它reference image作为(a,c¯),分别输入网络得到对token的梯度,计算不同channel的差异,差异小的代表该channel是action-related的特征,差异大的代表该channel是action-unrelated的特征,阈值法选取action-related的channel,得到mask。使用其余TI方法对c进行inversion,使用其余action造句并生成图像,得到(a¯,c),使用类似的方法得到一个mask。两个mask的交集作用于由(a,c)计算出的梯度,更新token。

 

ViewNeTI (direct)

Viewpoint Textual Inversion: Unleashing Novel View Synthesis with Pretrained 2D Diffusion Models

ViewNeTI

 

FSViewFusion (direct)

FSViewFusion: Few-Shots View Generation of Novel Objects

FSViewFusion

 

CustomDiffusion360

Customizing Text-to-Image Diffusion with Camera Viewpoint Control

CustomDiffusion360

  1. Given multi-view images of a new object, we create a customized text-to-image diffusion model with camera pose control.

 

Continuous 3D Words (direct)

Learning Continuous 3D Words for Text-to-Image Generatio

learn a continuous function that maps a set of attributes from some continuous domain to the token embedding domain

 

Face

FastComposer (transform, no test-time fine-tuning)

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

FastComposer

 

DreamIdentity (transform, no test-time fine-tuning)

DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation

DreamIdentity

使用预训练的ViT架构的人脸识别模型,提取3,6,9,12和最后一层的CLS token位置的feature,concat在一起,分别使用2个MLP将其转化成2个token embedding,使用diffusion loss和token embedding的L2正则进行训练。

使用不同层的feature的原因是最后一层的feature蕴含的都是比较高级的语义信息,缺少一些细节。

 

Face2Diffusion (transform, no test-time fine-tuning)

DreamIdentity

类似DreamIdentity利用预训练人脸模型的multi-scale feature,同时使用一个预训练expression encoder提取表情feature,以20%概率替换为一个可学习的代表无表情的向量,两个feature concat在一起,使用mapping network转化为token embedding,使用diffusion loss进行训练。

 

PhotoMaker (transform, no test-time fine-tuning)

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

  1. StableDiffusion

  2. 训练集由多个不同的人物id组成,每个人物id包含同一个人的多个image-text pair,text中包含man或woman描述image。训练时,使用CLIP image encoder将某个id的N张images编码,得到N个image embedding,使用text encoder编码带有base class word(如man)的prompt(长度为L),提取base class word所在位置的token embedding,通过可训练的MLP,将token embedding和N个image embedding都融合一遍,得到N个融合后的 id embedding,stack起来,得到长为N的stacked id embedding,替换掉base class word所在位置的token embedding,得到长度为L1+N的text embedding,送入StableDiffusion的cross-attention,训练MLP进行重构。额外还可以LoRA fine-tune cross-attention layer。

  3. 生成时不再需要额外训练,任意给定某个人物的几张image,编写prompt进行生成。

PhotoMaker

 

PortraitBooth (transform, no test-time fine-tuning)

PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization

类似PhotoMaker和PhotoMaker

PortraitBooth

 

Arc2Face (transform, no test-time fine-tuning)

Arc2Face: A Foundation Model of Human Faces

Arc2Face

 

IDAdapter (transform, no test-time fine-tuning)

IDAdapter: Learning Mixed Features for Tuning-Free Personalization of Text-to-Image Models

IDAdapter

图里没画出来,在prompt后加了"the woman is sks",并且at the first embedding layer of the text encoder, we replace the text embedding of the identifier word “sks” with the identity text embedding,但没有优化sks的token embedding,而是用学到的embedding取代。

 

InstantFamily (no pseudo word, no test-time fine-tuning)

InstantFamily: Masked Attention for Zero-shot Multi-ID Image Generation

InstantFamily

  1. 使用多人脸图像自监督训练。

  2. 采样时只需要提供aligned faces。

 

DiffSFSR (no pseudo word, no test-time fine-tuning)

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

DiffSFSR-1 DiffSFSR-2
  1. 给定一张人脸图像,和对场景和表情的描述,先使用StableDiffusion根据场景描述生成一张图像作为训练数据,再根据表情描述从数据库中选择一张具有该表情的图像作为表情条件,人脸图像作为id条件。

  2. 将训练数据的人脸部分mask掉(保留场景),concat在zt上,这样就可以让模型专注于人脸部分的建模。

  3. 使用diffusion loss + identity loss + expression loss 一起训练diffusion model,不需要自监督。

 

DemoCaricature (direct)

DemoCaricature: Democratising Caricature Generation with a Rough Sketch

ROME

 

Face Aging (direct)

Identity-Preserving Aging of Face Images via Latent Diffusion Models

  1. DreamBooth

  2. 计算Class-specific Prior Preservation Loss时,将人脸数据按age分组,每组一个组名,如child,old等,使用带有组名的prompt和图像作为数据集。

  3. 训练后,使用photo of a person as 进行生成,实现对某个给定人脸的aging与de-aging。

 

CelebBasis (direct)

Inserting Anybody in Diffusion Models via Celeb Basis

  1. StableDiffusion的text embedding是可以插值进行生成的,基于这一发现,可以收集一些CLIP text encoder能够识别的名人的人名,使用PCA算法计算出它们token embedding的一组基,这组基可以看成人脸特征在token embeeding space的表示。

  2. 训练时,给定任意一张人脸的图片,训练一个MLP去modulate这组基,组成该人脸对应的pseudo word的embedding,插入"a photo of _",使用Textual Inversion方法训练这个MLP。

 

CharacterFactory

CharacterFactory: Sampling Consistent Characters with GANs for Diffusion Models

CharacterFactory

  1. 人物工厂,不是TI,不需要reference image,直接生成随机的可用的pseudo word embedding。

  2. 使用GAN生成fake embedding,采样名人的人名作为real embedding,对抗训练。

  3. Lcon让生成的pseudo word embedding在不同template prompt的text embedding中表现一致。做法是最小化不同template prompt中这些word在text embedding中对应的embedding的pairwise distances。

 

 

StableIdentity (direct)

StableIdentity: Inserting Anybody into Anywhere at First Sight

受Celeb Basis启发,寻找一些名人的人名,得到他们的word embedding。通过一个MLP将输入人脸图像转化为两个word embedding,通过AdaIN转化到celeb word embedding空间(celeb word embedding的均值和方差分别充当shift和scale),TI训练这个MLP。学到的两个word embedding可以用于任何text-based generative model,比如ControlNet,text2video。

 

SeFi-IDE (direct)

SeFi-IDE: Semantic-Fidelity Identity Embedding for Personalized Diffusion-Based Generation

 

LCM-Lookahead (no pseudo word, no test-time fine-tuning)

LCM-Lookahead for Encoder-based Text-to-Image Personalization

LCM-Lookahead

  1. 专注人脸的IP-Adapter。

 

Inpainting

RealFill

RealFill: Reference-Driven Generation for Authentic Image Completion

有reference images的inpainting任务,借助TI技术提取reference images的信息,辅助inpainting。

RealFill

 

PVA

Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention

类似RealFill,有reference images的inpainting任务,借助TI技术提取reference images的信息,辅助inpainting。

 

Restoration

Personalized Restoration

Personalized Restoration via Dual-Pivot Tuning

有reference images的restoration任务。

 

X-to-Image (more fine-grained than text-to-image)

training-free的方法大致有两种,一种是类似nursing的操作,设计一种xt与给定条件的loss,在采样过程中计算loss并使用其梯度指导采样,另一种就是直接操作attention map,让其符合给定条件的constraint。

 

Sketch

SKG (sketch + text)

Sketch-Guided Text-to-Image Diffusion Models

SKG

  1. 为预训练好的StableDiffusion引入sketch。

  2. 使用预训练好的edge提取器生成训练数据(自监督),训练一个可以根据UNet的各层feature maps预测edge的MLP。方法类似于Label-Efficient Semantic Segmentation With Diffusion Models。

  3. 采样时用MLP损失函数的梯度做classifier guidance,只在T到0.5T加guidance。

  4. 使用dynamic guidance scheme:α=ztzt12ztL2s,其中s为常数,zt1是原扩散模型采样的结果。其动机是,如果某一步前后变化较大,则表明这一步会生成了更多信息,所以要增大guidance;如果某一步的guidance本身变化较大,则减小scale,防止过度引导。

 

SketchAdapter (sketch + text)

It’s All About Your Sketch: Democratising Sketch Control in Diffusion Models

SketchAdapter

只在很少的skecth-image pair数据集上训练,没有text。

使用CLIP编码sketch,取最后一层的feature sequence,只训练一个sketch adapter,将其转化成CLIP text embedding,送入StableDiffusion的cross-attention进行训练,除了diffusion loss,还有两个额外的loss:每一步预测的z^0过VAE decoder后经过一个sketch提取器,得到的结果与输入sketch计算距离;用image caption模型生成图像caption,送入StableDiffusion,让两个StableDiffusion的预测尽量靠近。

 

ToddlerDiffusion

ToddlerDiffusion: Flash Interpretable Controllable Diffusion Model

模拟人类画图的思路,先生成sketch,再生成palette,最后生成图像。使用ShiftDDPMs的公式,以sketch或palette而不是pure noise为起点进行训练。

 

Layout/Segmentation

IIG (bounding box + text)

Semantic-Driven Initial Image Construction for Guided Image Synthesis in Diffusion Model

受Initial Image Editing的启发,只需要精心构建xT即可实现layout-to-image。

做法是利用StableDiffusion,最深层cross-attention map的一个值对应zT中一个4×4的noise block,构造prompt,使用denoising第一步得到的cross-attention map的值对noise block进行标注,构建一个趋于生成某一类物体的noise block的数据库。

生成时,从物体对应的noise block数据库中采样,填在指定的bounding box内进行生成。

 

NoiseCollage (bounding box + text)

NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging

NoiseCollage

masked cross-attention:layout之内的image feature与object prompt进行cross-attention,layout之外的image feature与global prompt进行cross-attention,两者结果相加。

 

TriggerPatch (bounding box + text)

The Crystal Ball Hypothesis in Diffusion Models: Anticipating Object Positions from Initial Noise

  1. A trigger patch is a patch in the noise space with the following properties: (1) Triggering Effect: When it presents in the initial noise, the trigger patch consistently induces object generation at its corresponding location; (2) Universality Across Prompts: The same trigger patch can trigger the generation of various objects, depending on the given prompt.

  2. We try to train a trigger patch detector, which functions similarly to an object detector but operates in the noise space. 随机噪声,生成图像,使用预训练好的object detector检测物体,检测得到的结果作为该噪声的ground truth,训练trigger patch detector。

  3. 生成时,随机噪声,检测trigger patch,移动trigger patch到目标位置。

 

LayoutDiffuse (bounding box + text)

LayoutDiffuse: Adapting Foundational Diffusion Models for Layout-to-Image Generation

  1. adapt pre-trained unconditional or conditional diffusion models,在每个attention layer后加一个带residual的layout attention layer,即h=LayoutAttn(h)+h。

  2. LayoutAttn(h)将layout分成每个instance单独的layout(即只标识了一个object),每个layout当成mask,提取h中该object的region feature map,然后为每个feature加上该object对应的class label或者caption的learnable embedding,然后做self-attention;对于h,使用空标签或者空字符串的learnable embedding加到每个feature上,做self-attention,作为背景;然后乘上mask加在一起,重叠部分取平均。类似ControlNet,参数初始化为0,LayoutAttn(h)一开始输出为0,训练开始前不影响原网络。

LayoutAttn

 

LayoutDiffusion (bounding box + text)

LayoutDiffusion Controllable Diffusion Model for Layout-to-image Generation

  1. 重新设计UNet,全部重新训练。

 

PLACE (bounding box + text)

PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis

PLACE

  1. N个word-layout pair组成一条数据。

  2. layout control map:将layout转换为semantic mask,让对应的word的cross-attention map只有semantic mask内的响应值,但由于StableDiffusion是在8倍下采样的latent上运行的(深层的feature map更小),对mask采取同样的下采样可能会导致一些小物体被忽略,所以这里通过感受野计算mask,对于feature map上每个image token,如果其在原图尺寸上的感受野与当前物体的semantic mask有交集,则设为1,否则设为0。使用原cross-attention map与乘上mask后的cross-attention map的插值。

  3. Semantic Alignment Loss:encourages image tokens to interact more with the same and related semantic regions in the self-attention module, thereby further improving the layout alignment of the generated images. 通过cross-attention控制self-attention,对于某个word,将其cross-attention map(HW)作为权重计算self-attention map(HW×HW)的加权和(HW​),优化这个加权和与cross-attention map靠近。

  4. Layout-Free Prior Preservation Loss:由于数据集较小,为了防止过拟合,使用一些文生图数据计算diffusion loss,此时把layout control map中的semantic mask cross-attention map的插值系数设为0即可。

 

MIGC (bounding box + text)

MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis

MIGC

 

B2B (bounding box + text)

Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models

  1. StableDiffusion

  2. training-free

  3. box: 对text中有bounding box的object对应的cross-attention map,定义一些bounding box附近的sliding box,bounding box内的响应值减去bounding box外的响应值再加上这些sliding box内的响应值与bounding box内的响应值的IoU(保证均匀),作为object reward。

  4. bind: attribute的cross-attention map与对应的object的cross-attention map在bounding box内的响应值的KL散度的相反数,作为attribute reward。

  5. 两个reward加在一起求梯度作为guidance。

 

R&B (bounding box + text)

R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image Generation

  1. StableDiffusion

  2. training-free

R&B

 

LAW-Diffusion (bounding box + text)

LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts

有点类似SpaText,每个object都对应一个region map,其大小和图像一致,并在bounding box内填上可训练的对应object的embedding,bounding box外填上可训练的background的embedding。所有region map分成patch,不同region map的同一个位置的patch组成一个序列,序列前再prepend一个agg embedding,送入一个ViT,不需要线性映射,不需要加positional embedding,取agg embedding的输出。所有位置都按此处理,按位置排列所有输出,组合成图像大小的一个layout embedding。训练一个diffusion model,将layout embedding与xt concat在一起输入网络。

 

SALT (bounding box + text)

Spatial-Aware Latent Initialization for Controllable Image Generation

SALT

 

LayoutLLM-T2I (text -> bounding box -> image)

LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation

  1. in-context learning:从训练集(COCO,带prompt和bounding box标注)中随机采样一批样本作为candidate set,训练一个策略网络,该策略网络根据查询prompt,从candidate set选取几个样本作为in-context examples,为ChatGPT输入in-context examples和查询prompt,生成prompt中object的bounding box(文本形式)。策略网络根据mIoU和CLIP相似度等reward训练。

  2. GLIGEN fine-tune StableDiffusion。

 

DivCon (text -> bounding box -> image)

DivCon: Divide and Conquer for Progressive Text-to-Image Generation

DivCon

 

LLM Blueprint (text -> bounding box -> image)

LLM Blueprint: Enabling Text-to-Image Generation with Complex and Detailed Prompts

LLM-Blueprint

 

RealCompo (text -> bounding box -> image)

RealCompo: Dynamic Equilibrium between Realism and Compositionality Improves Text-to-Image Diffusion Models

RealCompo

利用ChatGPT生成layout后,利用L2I模型(如GLIGEN)和T2I模型的一起生成,做法是每一步生成时使用系数组合两个模型预测的噪声作为DDIM计算下一步的噪声,并根据DDIM的计算结果定义一个loss更新系数作为下一步的系数,以动态调整真实性和组合性。

 

Reason out Your Layout (text -> bounding box -> image)

Reason out Your Layout: Evoking the Layout Master from Large Language Models for Text-to-Image Synthesis

  1. CoT reasoning:in-context让GPT3.5根据prompt生成layout。

  2. 在StableDiffusion的self-attention和cross-attention之间插入一个可训练的Layout-Aware Cross-Attention,用layout生成mask作用于cross-attention map上。

 

SimM (text -> bounding box -> image)

Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation

  1. StableDiffusion

  2. training-free

  3. 从带有位置关系的text中解析出一个粗糙的layout(比如middle对应图像中央一个方框,left对应左边占1/3的框,都是固定大小的),与第一步生成时产生的cross-attention map做比对,阈值法看是否有layout的不匹配,如果匹配就不介入,直接生成;如果不匹配,则进行介入。

  4. 介入:首先从T生成到Tloc,使用[T,Tloc]产生的cross-attention map计算一个均值,使用上面说的固定大小的框,在cross-attention map上sliding window,使用阈值法确定不同token的layout。在Tloc之后的生成开始修改每一步的cross-attention map,对于某个token,由于分配的layout框和计算出的layout框大小一样,所以可以直接将计算出的layout框中的响应值直接复制到分配的layout框中,同时对框内响应值做增强,对框外响应值做抑制。

 

GeoDiffusion (bounding box -> text -> image)

Integrating Geometric Control into Text-to-Image Diffusion Models for High-Quality Detection Data Generation via Text Prompt

translate geometric conditions to text(包括object坐标等),fine-tune StableDiffusion。

 

Directed Diffusion (bounding box + text)

Directed Diffusion: Direct Control of Object Placement through Attention Guidance

  1. StableDiffusion

  2. training-free

  3. 在生成时,提高text token对应的cross-attention map的bounding box区域的权重。

 

Attention Refocusing (bounding box + text)

Grounded Text-to-Image Synthesis with Attention Refocusing

  1. StableDiffusion

  2. training-free

  3. attention refocusing

cross-attention refocusing:

类似Attend-and-Excite,Lfg:计算每个text token对应的cross-attention map中,对应的bounding box之内的cross-attention response的最大值,求和。Lbg:计算每个text token对应的cross-attention map中,对应的bounding box之外的cross-attention response的最大值,求和。LCAR=Lfg+Lbg

self-attention refocusing:

LSAR:对于每个bounding box,对于当前bounding box之内的所有image token,求它们的self-attention map中,所有包含该image token的bounding box所覆盖的地方之外的response的最大值,求和。

采样时计算上述loss,用LCAR+LSARxt的梯度作为guidance。

 

BoxDiff (bounding box/segmentation + text)

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion

  1. StableDiffusion

  2. training-free

  3. 只在16x16的分辨率上进行操作

  4. 类似Attention Refocusing,生成时给定text中某些子字符串对应的bounding box,在对应的cross-attention map中分别使用Inner-Box Constraint (增强bounding box中的response,鼓励当前物体出现在bounding box内),Outer-Box Constraint (削弱bounding box外的response,防止当前物体出现在bounding box外),Corner Constraint (鼓励当前物体填满bounding box,而不是在bounding box生成一个很小的物体),多个loss的和对xt的梯度作为guidance

 

CAC (bounding box/segmentation + text)

Localized Text-to-Image Generation for Free via Cross Attention Control

Layout-to-Image Generation with Localized Descriptions using ControlNet with Cross-Attention Control

Enhancing Image Layout Control with Loss-Guided Diffusion Models

  1. StableDiffusion

  2. training-free

  3. Cross Attention Control

  4. 除了text之外,额外提供了m个instance prompt-bounding box/segmentation pairs,生成时,将text和所有instance prompt pad成同一长度,同时送入StableDiffusion,这样就生成了m+1个cross-attention map,乘上bounding box/segmentation的mask,相加得到最后的cross-attention map。相比于Attention Refocusing,不需要计算loss和梯度。

 

SpaText (segmentation + text)

SpaText: Spatio-Textual Representation for Controllable Image Generation

  1. 每个segment对应一个text,可以分区域生成,指定物体之间的空间关系。

  2. 自监督训练,使用预训练分割模型提取图像segments,用CLIP提取每个segment的CLIP image embedding,初始化一个全为0的segmentation map,大小和图像一样,通道数和CLIP image embedding维数一样,将每个segment的CLIP image embedding放到segmentation map中对应位置。

  3. 改造DALLE-2的Decoder,将segmentation map直接concat到xt上作为条件输入,fine-tune decoder,训练时不需要文本。

  4. 推理时用DALLE-2的Prior模型将每个segment对应的text的CLIP text embedding转换成CLIP image embedding,再组装成segmentation map,使用Decoder进行生成。

 

EOCNet (segmentation + text)

Enhancing Object Coherence in Layout-to-Image Synthesis

修改StableDiffusion网络结构,fine-tune。

 

FreestyleNet (segmentation + text)

Freestyle Layout-to-Image Synthesis

将StableDiffusion的cross-attention改为rectified cross-attention:将text token对应的cross-attention map中,在bounding box之内的保留原值,在bounding box之外的设为负无穷。By forcing each text token to affect only pixels in the region specified by the layout, the spatial alignment between the generated image and the given layout is guaranteed。再使用任何layout-based数据fine-tune StableDiffusion。

 

ALDM (segmentation + text)

Adversarial Supervision Makes Layout-to-Image Diffusion Models Thrive

ALDM

  1. 传统训练方法只是将layout作为条件输入模型优化diffusion loss,并没有对layout的显式监督,可能导致生成结果和layout不匹配。一个解决方法是使用预训练的segmentor对x^0​进行分割与给定layout对比计算loss进行优化,we observe that the diffusion model tends to learn a mean mode to meet the requirement of the segmenter, exhibiting little variation.

  2. 引入对抗训练,判别器:训练将ground truth每个pixel正确分类到N个real class,将x^0所有pixel分类到fake class;diffusion model作为生成器:除了diffusion loss,加入adversarial loss,让判别器指导训练。

  3. multistep unrolling:由于layout是diffusion生成早期阶段就决定的,但此时x^0都不太好,所以一次性生成之后K个x^0,计算K个adversarial loss求平均进行训练。

 

DenseDiffusion (segmentation + text)

Dense Text-to-Image Generation with Attention Modulation

  1. StableDiffusion

  2. training-free

  3. 和rectified cross-attention一样的思路,只不过是training-free的,可以直接采样:At cross-attention layers, we modulate the attention scores between paired image and text tokens to have higher values. At self-attention layers, the modulation is applied so that pairs of image tokens belonging to the same object exhibit higher values。这里的paired image and text tokens意思是当前image token的位置在text token所描述的object的bounding box内。

  4. softmax(QKT+Md)M=λtRMpos(1S)λt(1R)Mneg(1S)λt=wtTR为二值矩阵,对于cross-attention,若text token和image token同属同一个segment,则为1,否则为0,对于self-attention,若两个image token同属一个segment,则为1,否则为0;Mpos=max(QKT)QKTMneg=QKTmin(QKT),max和min只针对key-axis,这是为了不让QKT+M偏离原来的QKT太远,同时让调整力度正比于原值与极值的差;S为比例矩阵,如果segments之间面积差别较大,生成质量会受影响,所以对每个image token,计算出其所属的segment的面积占全图的比例,用于正则。

 

SCDM (segmentation)

Stochastic Conditional Diffusion Models for Robust Semantic Image Synthesis

SCDM

In real-world applications, semantic image synthesis often encounters noisy user inputs. SCDM enhances robustness by stochastically

perturbing the semantic label maps through Label Diffusion, which diffuses the labels with discrete diffusion.

 

MagicMix (layout/style from image/text + text)

MagicMix: Semantic Mixing with Diffusion Models

noisy latents linear combination版本的SDEdit,削弱原图的细节,只保留基本的结构和外观信息。

MagicMix

 

DiffFashion

DiffFashion: Reference-based Fashion Design with Structure-aware Transfer by Diffusion Models

DiffEdit+MagicMix

 

CompFuser (text)

Unlocking Spatial Comprehension in Text-to-Image Diffusion Models

对于含有左右位置的两个物体的prompt,先正常生成其中一个物体,再利用place * on the left这样的instruction进行编辑。

编辑模型类似InstructPix2Pix,使用LLM-grounded diffusion,生成两个物体的layout,只用其中一个layout生成原图,两个layout都用生成目标图,instruction一起,LoRA fine-tune InstructPix2Pix。

 

GLoD (layout + text)

GLoD: Composing Global Contexts and Local Details in Image Generation

GLoD

  1. Masked SEGA.

 

Pose

StablePose

StablePose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

 

 

Scene Graph

DiffuseSG (scene graph)

Joint Generative Modeling of Scene Graphs and Images via Diffusion Models

training DiffuseSG model (Graph Transformer) to produce scene graphs and then utilizing a pretrained layout-to-image model to generate images.

 

Blob

BlobGEN (blob + text)

Compositional Text-to-Image Generation with Dense Blob Representations

  1. GLIGEN with blob tokens

 

DiffUHaul (blob + layout)

DiffUHaul: A Training-Free Method for Object Dragging in Images

DiffUHaul

 

Image

IP-Adapter (image + text)

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

IP-Adapter

  1. StableDiffusion

  2. CLIP image encoder提取image embedding,训练一个线性层将其映射到长为4的sequence,类似StyleAdapter,加一个和text cross-attention layer并行的可训练的image cross-attention layer,使用原来的数据集,训练线性层和image cross-attention layer。

  3. 训练好的模型可以与ControlNet和T2IAdapter一起使用,无需额外训练。

 

Semantica (image)

  1. 使用成对的图像数据集,其中一张作为condition,另一张作为target,重新训练一个U-ViT的diffusion model,we do not use any text inputs and only rely on image conditioning.

  2. 使用预训练的CLIP或者DINO编码图像得到的token sequence或者CLS token作为condition,当使用token sequence时使用cross-attention,当使用CLS token时使用FiLM。

 

PuLID (image + text)

PuLID: Pure and Lightning ID Customization via Contrastive Alignment

PuLID

  1. IP-Adapter在训练时使用从原图中提取的feature,这一定程度上会导致模型过拟合,除了diffusion loss,还引入了两个alignment loss和一个ID loss。

  2. 训练时构造两条contrastive paths,one path with ID:两个cross-attention都用;the other path without ID:只用text cross-attention。为了确保sementic alignment使用text作为Q,image feature作为KV,计算cross-attention map,优化两条paths的cross-attention map之间的MSE loss。The insight behind our semantic alignment loss is simple: if the embedding of ID does not affect the original model’s behavior, then the response of the UNet features to the prompt should be similar in both paths.

  3. 为了确保layout alignment,同时优化两条paths的image feature的MSE loss。

  4. 使用4步生成,使用生成的图像计算ID loss。

 

InstantID (image + text)

InstantID: Zero-shot Identity-Preserving Generation in Seconds

InstantID

  1. 上半部分类似IP-Adapter,只是将CLIP image embedding换成了face id embedding。但是作者认为这种方法不够好,因为image token和text token本身提供的信息就不同,控制的方式和力度也不同,但是IP-Adapter却把他们concat在一起,有互相dominate和impair的可能。

  2. 提出使用另一个IdentityNet(ControlNet架构)提供额外的空间信息,根据上述原因,这里的ControlNet去掉了text的cross-attention,只保留face id embedding的cross-attention。这里只提供双眼、鼻子、嘴巴的key points作为输入,一方面是因为数据集比较多样,更多的key points会导致检测困难,让数据变脏;另一方面是为了方便生成,也可以增加使用文本或者其他ControlNet的可编辑性。

  3. 在人脸数据集上自监督训练。

 

ID-Aligner (image + text)

ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning

ID-Aligner

  1. A general framework to achieve identity preservation via feedback learning.

 

DEADiff (image style + text)

  1. StableDiffusion

  2. 使用frozen CLIP提取reference image的feature,Q-Former的query与feature和"content"/"style"进行cross-attention,Q-Former的输出输入StableDiffusion的text cross-attention,新训练一个KV projection maxtrix(Q还用text cross-attention的),将Q-Former的输出project之后与text的KV concat在一起进行计算,算是IP-Adapter的变种。

  3. 训练时如果使用"style",就用style相同但content不同的image pair进行训练,"content"同理。注意推理时只使用"style",训练时的"content"是为了让style representation的提取更加解耦。

DEADiff

 

Specialist Diffusion (image style + text)

Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style

  1. StableDiffusion

  2. 每个style(如Flatten Design, Fantasy, Food doodle等)收集几十对text-image数据,做数据增强,fine-tune StableDiffusion,作为这个style的specialist diffusion,输入文本就可以生成这个style的图像。

 

VisualStylePrompt (image + text)

Visual Style Prompting with Swapping Self-Attention

生成时,将decoder某一层之后的所有self-attention的key和value替换为reference image生成时的相应位置的self-attention的key和value。

VisualStylePrompt

 

Prompt-Free Diffusion (image)

Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models

类似ObjectStitch,训练一个SeeCoder将reference image转换为CLIP text embedding,然后使用其替换StableDiffusion的CLIP text encoder,实现只使用reference image生成图像。还可以使用ControlNet引入其它条件。

 

M2M (image sequence)

Many-to-many Image Generation with Auto-regressive Diffusion Models

M2M

  1. 构造一个image sequence数据集。

  2. 训练时每个样本是一个image sequence {z0i}i=1N,加噪得到{zti}i=1N,输入diffusion model,{zti}i=1N作为Q,{z0i}i=1N作为KV,进行cross-attention,加一个causal mask,zti中的pixel只能与z0<i的图像的pixel进行attention。

 

General

Late-Constraint (sketch/edge/segmentation + text)

Late-Constraint Diffusion Guidance for Controllable Image Synthesis

  1. 为预训练好的StableDiffusion引入各种条件,算是SKG的升级版。

  2. 使用预训练好的模型抽取image的各种conditions(如mask、edge等),训练一个可以根据UNet的各层feature maps预测conditions的condition adapter。

  3. 采样时,用当前的feature maps输入到condition adapter得到预测的conditions,与给定的conditions计算距离,求梯度作为guidance。

  4. 这类方法本质上还是训练一个noisy classifier,但使用的是diffusion model的feature。

 

Readout-Guidance (sketch/edge/pose/depth/drag + text)

Readout Guidance: Learning Control from Diffusion Features

Readout-Guidance-1

Readout-Guidance-2

  1. 和Late-Constraint类似,分为spatial和relative两种head。

  2. spatial包含pose,edge,depth等,训练模型根据diffusion feature预测ground truth,采样时根据预测和给定的label计算MSE loss,求梯度作为guidance。

  3. relative包含corresponce feature和appearance similarity,训练模型根据两个不同图像的diffusion feature进行预测。

  4. drag:corresponce feature head uses image pairs with labeled point correspondences and trains a network such that the feature distance between corresponding points is minimized, i.e., the target point feature is the nearest neighbor for a given source point feature. We compute pseudo-labels using a point tracking algorithm to track a grid of query points across the entire video. We randomly select two frames from the same video and a subset of the tracked points that are visible in both frames. 训练时,将输入的diffusion feature转化为一个feature map,image pairs的feature map之间的corresponding point feature之间计算loss;编辑时,先将原图输入UNet得到diffusion feature,再送入网络提取feature map,计算其staring point处的feature与生成图像的feature map的target point处的feature的距离,求梯度作为guidance。

 

MCM (segmentation/sketch + text)

Modulating Pretrained Diffusion Models for Multimodal Image

xt,ϵθ(xt),y1,,yn一起输入MCM网络,输出modulate参数γt,vt,使用ϵt=ϵθ(xt)(1+γt)+vt根据Tweedie's formula计算出x^0,MSE进行训练。

 

Acceptable Swap-Sampling (concept from text)

Amazing Combinatorial Creation Acceptable Swap-Sampling for Text-to-Image Generation

给定两个object text,生成两个concept融合在一起的图像,类似MagicMix。

对于一个0-1的列交换向量,其长度和CLIP编码结果的维度相同,若向量某位置为0,则选取第二个object text的CLIP编码结果的该位置的列向量,若向量某位置为1,则选取第一个object text的CLIP编码结果的该位置的列向量,组合成一个新的CLIP编码结果,将其输入到StableDiffusion是可以生成两个concept融合在一起的图像的。

实践中,随机采样一堆列交换向量,每个列交换向量按上述流程生成图像,再使用一些选取策略从所有图像中选出最符合标准的图像。

 

SCEdit (keypoints/depth/edge/segmentation + text)

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing

SCEdit

  1. 通过对skip conncetion的feature做editing实现fine-tune或者可控生成。

  2. SC-Tuner:OjSC(xNj)=Tj(xNj)+xNj,其中j为decoder block的index,N代表UNet层数,xNj是第j层decoder对应的encoder输出的feature,Tj是可训练的Tuner,先过一个矩阵降维,再过GELU,最后过一个矩阵升维,这里只针对channal维进行操作。该方法可以视为LoRA的counterpart,是一种通用的fine-tune方法,比如可以用于将模型adapt到某个style domain。

  3. CSC-Tuner:OjCSC(xNj)=m=1Mαm(Tj(xNj+cjm)+cjm)+xNj,其中{cm}m=1MM个条件,如depth等,这些条件也送入一个可训练的hint block产生multi-scale feature。该方法可以视为ControlNet的counterpart。

 

GLIGEN (image entity/image style/bounding box/keypoints/depth/edge/segmentation + text)

GLIGEN: Open-Set Grounded Text-to-Image Generation

GLIGEN

  1. Stableiffusion

  2. 除了caption,额外给定一组entity和对应的grounding信息(比如layout),进行spatial control。

  3. 在self-attention和cross-attention之间加一个可训练的gated self-attention层,把grounding token和visual token接在一起做self-attention,输出只保留visual token所在位置的部分,乘上一个可训练的gate标量,residual连接。gate标量初始化为0,类似ControlNet的zero-conv,保证一开始的网络和Stableiffusion有一样的效果。

  4. grounding token由entity和对应的grounding的feature同时输入一个可训练的MLP预测。entity可以是文本或者图像,为文本时就用预训练文本编码器提取其feature,为图像时就用预训练图像编码器提取其feature,grounding使用Fourier embedding提取其feature,如果是layout,就是左上右下两个坐标,如果是keypoint,就是一个坐标,如果是depth map,此时就没有entity了,直接使用一个网络将其转换为h×w个token,同时将depth map降采样后concat到输入上,训练StableDiffusion的第一个卷积层。

 

ReGround (image entity/image style/bounding box/keypoints/depth/edge/segmentation + text)

ReGround: Improving Textual and Spatial Grounding at No Cost

GLIGEN

  1. 把GLIGEN改成类似IP-Adapter的并行attention形式,不用重新训练,直接把训练好的GLIGEN改成ReGround的形式,效果也能变好。

 

InteractDiffusion (interaction + text)

InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models

定义interaction是一个三元组,分别是主体(subject)、动作(action)和客体(object),三者分别对应一个文本描述和一个bounding box,主体和客体使用同一个MLP,将文本(预训练文本编码)和bounding box(Fourier embedding)转化一个token,动作用另一个MLP也转化为一个token。

如果一张图中有多个interaction,那么不同interaction之间无法区分,所以为每个interaction加一个可训练的embedding,类似positional embedding。同样,一个interaction中三元组之间也无法区分,所以为三者各加一个可训练的embedding,所有interaction公用该embedding。

得到最终的embedding后,类似GLIGEN进行训练。

InteractDiffusion

 

InstDiff (box/mask/scribble/point + text)

InstanceDiffusion: Instance-level Control for Image Generation

InstDiff

 

ControlNet (edge/segmentation/keypoints + text)

Adding Conditional Control to Text-to-Image Diffusion Models

  1. 为预训练好的StableDiffusion引入类似PDAE的条件模块ControlNet。

  2. ControlNet:固定StableDiffusion,复制StableDiffusion的UNet的encoder和middle block的每个block进行训练,输出与UNet对应的decoder的输出进行加和。zero convolution是所有参数都初始化为0的1x1卷积层,这样在训练前整个trainable copy的输出为0,不影响原网络。

  3. condition一般和原图尺寸一样。由于要和原网络的input相加,所以尺寸必须和原网络的input相同。StableDiffusion的input是降维后latent,所以condition也需要降维,所以就需要额外训练一个encoder对condition进行编码降维。

  4. 多个ControlNet可以组合使用。

  5. StableDiffusion一般必须用classifier-free guidance才能生成较好的图像,此时ControlNet可用于both unconditional and conditional prediction,也可只用于conditional prediction。但是如果想不使用prompt进行生成,此时如果将ControlNet用于both,cfg退化,效果不好;如果将ControlNet只用于conditional prediction,会导致guidance太强,解决方案为resolution weighting。

ControlNet1

ControlNet2

 

ControlNet-XS

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

ControlNet-XS

  1. ControlNet存在information delay的问题,即某个时间步的去噪时,SD encoder不知道control信息,ControlNet encoder不知道generative的信息。

  1. ControlNet-XS让两个encoder之间同步information,一个的feature map过一个可训练的convolution后加在另一个上,反之亦然,这样ControlNet encoder就不需要复制SD encoder了,而是可以使用参数量更少的处理同维度feature map的网络,随机初始化进行训练即可,效果还比ControlNet要好。

 

FineControlNet

  1. StableDiffusion + ControlNet

  2. training-free

  3. 将多实例输入进行分离,修改cross-attention,每个实例过一次cross-attention,所有实例的输出相加得到最后输出。在UNet feature上进行操作,所以在UNet encoder部分,只融合text信息,在UNet decoder部分,同时融合control信息和text信息。

FreeControl1

 

SmartControl

SmartControl: Enhancing ControlNet for Handling Rough Visual Conditions

SmartControl

  1. Relax the visual condition on the areas that are conflicted with text prompts. 如使用deer的depth map生成tiger时,鹿角部分需要舍去。

  2. ControlNet可以使用一个α控制条件的强度hi+1=Di(hi+αhcondi),当α减小时,样本符合条件的程度也会降低,但不稳定,每个visual condition都要精心挑选不同的α,因此根据这一点构造一个relax alignment的数据集,之后训练一个SmartControl。

 

ControlNet++

ControlNet++: Improving Conditional Controls with Efficient Consistency Feedback

ControlNet++-1

ControlNet++-2

  1. 加噪后去噪一步,使用x^0进行fine-tune。

 

X-Adapter

X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

X-Adapter

  1. train a universal compatible adapter so that plugins of the base stable diffusion model (such as ControlNet on SD) can be directly utilized in the upgraded diffusion model (such as SDXL).

  2. 训练一个mapper,将base model的decoder的feature映射到upgraded model的decoder的feature维度并加上去,使用upgraded model的diffusion loss训练mapper。注意训练时,upgraded model输入的是empty prompt。

 

Ctrl-Adapter

Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

Ctrl-Adapter

  1. 类似X-Adapter。Pretrained ControlNet cannot be directly plugged into new backbone models due to the mismatch of feature spaces, and the cost of training ControlNets for new backbones is a big burden for many users.

 

MGPF

Enhancing Prompt Following with Visual Control Through Training-Free Mask-Guided Diffusion

  1. 针对visual controls are misaligned with text prompts的问题,比如prompt中提到了某个object,但visual control中没有对应的edge,这样使用ControlNet生成出的图像会丢失这个object。

  2. 这本质上是ControlNet主导了生成的结果,所以提出了一种training-free的方法,根据每个object的edge提取mask,所有mask组合在一起,将ControlNet的feature乘上该mask再加到UNet decoder的feature上,目的是让ControlNet只负责生成有visual controls的objects,our experimental results show that the application of masks to ControlNet features substantially mitigates conflicts between mismatched textual and visual controls, effectively addressing the problem of object missing in generated images.

  3. 针对属性不绑定的问题,计算attibute和object的cross-attention map之间的overlap,梯度下降优化zt

 

CNC (depth/image/depth and image + text)

Compose and Conquer: Diffusion-Based 3D Depth Aware Composable Image Synthesis

CNC-1

CNC-2

  1. StableDiffusion + ControlNet

  2. 自监督训练,对于某张图像,提取salient object的mask,图像乘上mask即为foreground图像,图像乘上mask的补码再对salient object部分进行inapinting得到background图像。分别对foreground和background图像提取depth。

  3. 提取foreground和background图像的CLIP image embedding,经过一个网络后concat在text embedding后,在ControlNet的cross-attention层用上mask,让Q和foreground K只在mask区域有值,让Q和background K只在mask区域之外有值。

  4. foreground和background是不对等的,对调它们的输入会生成不同位置关系的图像,所以叫3D depth aware。

 

FreeControl (keypoints/depth/edge/segmentation/mesh + text)

FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition

  1. StableDiffusion

  2. training-free

  3. DDIM Inversion时,UNet decoder第一个self-attention之前的feature(query, key, value)为C×H×W,看成H×W个长度为C的向量,求PCA后,取前三个为基,求feature在这三个基上的坐标为3×H×W,画成图后具有分割属性。同一concept不同模态的图片进行DDIM Inversion有同样的效果。同一concept不同模态求得的基也是通用的。

FreeControl1

  1. 利用这一属性,先生成一些target concept的图片,得到N×C×H×W的feature,看成N×H×W个长度为C的向量,求PCA后取基。生成过程中,让生成图像的feature在这组基上的坐标与condition的feature在这组基上的坐标靠近,计算loss求梯度作为guidance。思想和Late-Constraint类似,只不过是training-free的。

FreeControl2

 

T2I-Adapter (edge/segmentation/keypoints + text)

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

T2I-Adapter

  1. 为预训练好的StableDiffusion的encoder输出的各分辨率的feature map加上由condition计算出的同尺寸的feature map,只优化T2I-Adapter。

 

BTC (sketch/depth/pose + text)

Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples

BTC

  1. 训练时不需要text,且只需要几十到几百个样本。

  2. 类似T2I-Adapter,训练一个prompt-free condition encoder,其输出的feature map加在StableDiffusion的encoder输出的各分辨率的feature map上。prompt-free condition encoder从StableDiffusion的encoder复制而来,去掉了cross-attention层,每个尺寸的feature map输入一个额外的zero convolution层。

 

DiffBlender (sketch/depth/edge/box/keypoints/color + text)

DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models

  1. StableDiffusion

  2. self-attention和cross-attention之间插入可训练的local self-attention和global self-attention进行多模态训练。

DiffBlender

 

Universal Guidance (segmentation/detection/face recognition/style + text)

Universal Guidance for Diffusion Models

  1. StableDiffusion

  2. forward guidance:利用Tweedie's formula根据xtϵθ(xt,t)计算x^0,输入off-the-shelf segmentation/detection/face recognition/style模型计算loss,求梯度作为guidance。

  3. backward guidance:在上述guidance的基础上,使用Decomposed Diffusion Sampling优化一个Δx0进一步guidance。

  4. 采样的每一步都使用resample technique重复多次forward guidance + backward guidance。

 

Multi-Modality

Composer (shape/semantics/sketch/masking/style/content/intensity/pallete/text)

Composer: Creative and Controllable Image Synthesis with Composable Conditions

  1. 用各种预训练网络提取图像的各种结构、语义、特征信息,然后作为条件训练GLIDE。

  2. 训练技巧:以0.1的概率丢弃全部conditions,以0.7的概率包含全部conditions,每个condition独立以0.5概率丢弃。

 

MaxFusion

MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models

MaxFusion

  1. 一个ControlNet接收不同模态输入进行训练,图中的不同task使用的是相同的网络。

  2. 将不同模态在每层计算完成后得到的feature进行merge然后skip-connect到UNet decoder,merge后的feature再unmerge为原来的数量输入到下一层。

  3. merge策略:对于每个spatial位置,计算两个feature之间的相关性,如果大于某个预设的阈值,就取两个feature的平均;如果小于阈值,就分别计算它们相对于各自整个feature的标准差,选择标准差较大的那个feature。

  4. baseline是Multi-T2I Adapter和Multi-ControlNet,即每个task单独训练一个T2I Adapter或ControlNet,然后一起使用。

 

OmniControlNet

OmniControlNet: Dual-stage Integration for Conditional Image Generation

OmniControlNet

  1. 先为不同模态分别学习一个pseudo word,例如使用几张depth map images和"use <depth> as feature"利用TI学习"<depth>"的word embedding。

  2. 之后使用不同模态训练ControlNet,其中trainable copy的prompt之前加上对应条件的模态的"use <depth> as feature",这样一个ControlNet就可以处理不同模态的条件。

 

gControlNet

Cocktail: Mixing Multi-Modality Controls for Text-Conditional Image Generation

  1. 多个模态的condition融合,输入到一个ControlNet进行训练,实现任意种模态的condition组合生成。

 

Uni-ControlNet

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

Uni-ControlNet-1

Uni-ControlNet-2

 

FaceComposer

FaceComposer: A Unified Model for Versatile Facial Content Creation

  1. 类似Composer,专做人脸,还支持talking face生成。

 

Any-to-Any

Versatile Diffusion

Versatile Diffusion: Text, Images and Variations All in One Diffusion Model

Multi-Flow

VersatileDiffusion

 

Multi-Source

Multi-Source Diffusion Models for Simultaneous Music Generation and Separation

多个音乐源拼接在一起进行训练,训练时所有音乐源都使用相同的时间步,噪声不一样。

total generation

partial generation:blended inpainting,配乐。

source separation:将某个要分离出来的音乐源视为所有音乐源的和减去其它音乐源的和。

 

UniDiffuser

One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale

使用预训练编码器将image和text都转换为token,额外训练两个decoder,可以根据token重构image和text。

text-image联合训练,使用U-ViT架构,训练时两者采样不同的时间步和噪声,这样可以做到unconditional(另一个模态一直输入噪声),conditional(另一个模态一直输入条件),joint(同步生成) sampling。

 

EasyGen

Making Multimodal Generation Easier When Diffusion Models Meet LLMS

  1. BiDiffuser:fine-tune UniDiffuser,只进行image-to-text和text-to-image,ϵxϵθx(xtx,y0,tx,0)22+ϵyϵθy(x0,yty,0,ty)22,即不再训练他们的联合分布。

  2. 将BiDiffuser和LLM联合。

EasyGen

 

CoDi

Any-to-Any Generation via Composable Diffusion

  1. 目标:generate any combination of output modalities from any combination of input modalities.

  2. We begin with a pretrained text-image paired encoder, i.e., CLIP. We then train audio and video prompt encoders on audio-text and video-text paired datasets using contrastive learning, with text and image encoder weights frozen。这样每个模态就能得到一个encoder,且编码结果共享一个common embedding space。每个模态以编码结果为条件训练一个diffusion model。

  3. 上面训练得到的是单模态的diffusion model,只能单对单自生成,还不能多对多生成。使用text-image数据,为text diffusion model和image diffusion model的UNet各自加入新的cross-attention层,训练时只训练这个cross-attention层,cross-attention的方式是为每个模态的noisy latent设计一个independent encoder,将不同模态的noisy latent嵌入到一个common embedding space,attend这个embedding token,除了diffusion loss同时也利用contrastive learning进行训练,这样text和image的noisy latent就可以通过它们的encoder对齐。之后固定住text的encoder和cross-attention weights,用text-audio数据,重复该方法,训练得到audio的encoder和cross-attention weights。之后固定audio的encoder和cross-attention weights,用audio-video数据,重复该方法,训练得到video的encoder和cross-attention weights。这样在cross-attention中,四种模态的noisy latent都被对齐了,之后可以interpolation不同noisy latent的encoder embedding进行joint sampling,即使这种combination可能没训练过。

 

CoDi-2

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation

对于多模态数据,利用Codi的multimodal encoder,将其它模态的编码结果(feature sequence)送入LLM进行训练,对输出(feature sequence)进行回归,同时将其输入对应模态的diffusion model计算diffusion loss,两个loss一起训练。

text还是token prediction loss进行训练。

本质还是feature-based而非token-based。

 

GlueGen

GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

为不同模态语料(如语音、外文等)学习一个编码网络,使编码结果(分布)与现有的StableDiffusion的text encoder的编码结果(分布)对齐。

这样就可以无缝切换,使用训练好的编码网络为StableDiffusion提供cross-attention的kv,做不同模态的生成。

不用fine-tune StableDiffusion,而且fine-tune会导致对之前模态的遗忘。

 

In-Context/Prompt/Instruction

UniControl

UniControl: A Unified Diffusion Model for Controllable Visual Generation In the Wild

模仿InstructGPT训练可根据instruction进行生成的StableDiffusion。

将不同任务整理成统一形式的task,每个task包含task instruction(如segmentation to image),prompt,visual conditon(segmentation)和target image,训练时使用ControlNet架构,prompt输入StableDiffusion,task instruction和visual condition输入ControlNet,多个task一起训练。可以泛化到zero-shot task和zero-shot task combination(如segmentation + skeleton to image)。

 

PromptDiffusion

In-Context Learning Unlocked for Diffusion Models

Improving In-Context Learning in Diffusion Models with Visual Context-Modulated Prompts

prompt由一个example pair和一个text构成,example pair由query image(如segmentation、edge map等)和query image对应的real image组成,之后给定一个新的query image,模型需要根据example pair和text生成对齐的图像。

训练好的模型还可以适用于unseen example pair,即In-Context Learning(无需训练的学习框架)。

模型架构和ControlNet一致,只是输入的条件变成了example pair和新的query image的组合。

 

ContextDiffusion

Context Diffusion: In-Context Aware Image Generation

ContextDiffusion

 

ImageBrush

ImageBrush: Learning Visual In-Context Instructions for Exemplar-Based Image Manipulation

ContextDiffusion

  1. 和PromptDiffusion一样的In-Context Learning,example pair + query image + target image组成一个2×2的grid作为数据进行训练,example pair和query image保持不变,diffusion训练生成target image。

 

InstructGIE

InstructGIE: Towards Generalizable Image Editing

InstructGIE

  1. 和ImageBrush类似。

 

Analogist

Analogist: Out-of-the-box Visual In-Context Learning with Image Diffusion Model

Analogist

  1. 和ImageBrush类似。

 

Human/Hand

HumanSD

HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation

HumanSD-1

HumanSD-2

  1. skeleton也用VAE encoder编码,concat在zt上。

  2. ϵθ(zt,t,c)ϵ输入pre-trained human pose heatmap estimator就可以估计出一个heat map,把这个heat map当做diffusion loss的权重。思想类似PDAE的梯度估计器,ϵθ(zt,t,c)ϵ类似梯度估计器的输出。

 

Parts2Whole

From Parts to Whole: A Unified Reference Framework for Controllable Human Image Generation

Parts2Whole

  1. Appearance Encoder的输入不加噪,且每个part image独立输入提供reference feature,输入的text为该part image对应的类别,如face、hair等。

  2. Shared Self-Attention的思想类似GLIGEN,进行self-attention后只保留image feature。如果有part image的mask,attention时只attend unmask部分的pixel。

  3. Decoupled Cross-Attention是IP-Adapter,两个并行的cross-attention layer分别处理text和part image。

 

HandRefiner

HandRefiner: Refining Malformed Hands in Generated Images by Diffusion-based Conditional Inpainting

HandRefiner

  1. hand depth map + ControlNet

 

Hand2Diffusion

Giving a Hand to Diffusion Models: a Two-Stage Approach to Improving Conditional Human Image Generation

Hand2Diffusion

  1. 先生成手再生成body。

 

HanDiffuser

HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances

HanDiffuser

  1. 以hand params为中介进行生成。

 

RHanDS

RHanDS: Refining Malformed Hands for Generated Images with Decoupled Structure and Style Guidance

RHanDS-1

RHanDS-2 RHanDS-3

 

  1. 将畸形的手从原图中割下来,输入RHanDS进行修复,之后再粘贴回原图。

  2. RHanDS的训练包含两个阶段,第一阶段构造数据集(同一个人的两只手作为一对数据)训练保持style,第二阶段使用一个3D模型提取mesh训练根据structure重构。该3D模型也可以根据畸形的手提取出正常手的mesh。

 

Text/Glyph

TextDiffuser

TextDiffuser: Diffusion Models as Text Painters

  1. 生成带文字的图片。

  2. 先训练一个Transformer生成文字的layout,再训练一个以layout的mask为条件的diffusion model生成图片。

 

TextDiffuser-2

TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering

  1. 训练一个LLM对text rendering进行layout planning,之后训练一个diffusion model根据layout planning进行生成。

 

CustomText

CustomText: Customized Textual Image Generation using Diffusion Models

CustomText

 

GlyphControl

GlyphControl: Glyph Conditional Control for Visual Text Generation

自监督训练,使用OCR模型识别带文字图像中的文字,并将其输入ControlNet训练重构原图。

 

GlyphDraw

GlyphDraw: Learning to Draw Chinese Characters in Image Synthesis Models Coherently

所有条件输入UNet重新训练。

 

AnyText

AnyText: Multilingual Visual Text Generation And Editing

AnyText

 

UDiffText

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

UDiffText

 

Brush Your Text

Brush Your Text: Synthesize Any Scene Text on Images via Diffusion Model

ControlNet + cross-attention mask constraint

 

LTOS

LTOS: Layout-controllable Text-Object Synthesis via Adaptive Cross-attention Fusions

LTOS

  1. object-layout control module由GLIGEN实现。

  2. visual-text rendering module由ControlNet实现(在GLIGEN的基础上),类似ControlNet-XS解决information delay问题一样,为了让layout与glyph信息有交互,让skip feature与backbone feature进行cross-attention后再进行skip-connection。

 

TextCenGen

TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

TextCenGen

  1. training-free.

 

 

Image Composition

Collage Diffusion

Collage Diffusion

将不同collage拼在一起并保证harmonization(无重叠)。

使用TI将每个collage编码进text embedding,同时修改StableDiffusion的cross-attention,类似MaskDiffusion引入mask信息,一起训练。

生成时为每个collage的pseudo word对应的cross-attenion map引入mask。

 

Diff-Harmonization

Zero-Shot Image Harmonization with Generative Model Prior

Given a composite image, our method can achieve its harmonized result, where the color space of the foreground is aligned with that of the background.

To achieve image harmonization, we can leverage a word whose attention is mainly constrained to the foreground area of the composite image, and replace it with another word that can illustrate the background environment.

 

RecDiffusion

RecDiffusion: Rectangling for Image Stitching with Diffusion Models

RecDiffusion

  1. task:rectangling

 

PrimeComposer

PrimeComposer: Faster Progressively Combined Diffusion for Image Composition with Attention Steering

PrimeComposer

  1. pixel composition:按照mask直接拼接在一起。

  2. correlation diffuser:object的inversion过程中的self-attention layer的KV取代pixel composition的self-attention layer的KV,注意只取代Mobj位置的KV。

  3. RCA:限制object对应的cross-attention在mask内,mask之外的响应值赋为负无穷。

  4. 每一步latent都要和background的inversion过程中的latent再做pixel composition,以保持背景。

 

Diffusion in Diffusion

Diffusion in Diffusion: Cyclic One-Way Diffusion for Text-Vision-Conditioned Generation

 

TF-ICON

TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition

将reference image注入到main image中,并且符合为main image的风格。

使用exceptional inversion将两个image编码到噪声,然后将reference image的编码噪声resize并注入到main image的编码噪声中,再生成。

 

Composite Diffusion

Composite Diffusion

scaffolding stage: 根据condition生成到某一中间步,只有大致的结构。

harmonization:text-guided generation or blended(若有segmentation condition)

 

LRDiff

Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis

vision guidance: 给xt加上(2M1)δ,其中M是和图像大小一样的mask,指定物体的区域,δ是个标量,for the region containing an object, we add δ to enhance the generation tendency of that object. Conversely, for areas outside the targetregion, we subtract δ to suppress the generation tendency of the object. 这个δ的值可以由cross-attention map高响应值区域的xt的值取平均得到。

LRDiff

 

Make-A-Storyboard

Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control

Make-A-Storyboard

  1. 使用TI分别学习concept和scene,如果直接用concept+scene造句,生成效果不佳。可以先用concept生成,然后提取mask。然后分别用concept和scene进行生成,到某一步λ时,使用mask融合两者的xλ,之后进行交替生成,一步使用只带concept的句子,一步使用只带scene的句子。

 

 

AnyScene

AnyScene: Customized Image Synthesis with Composited Foreground

AnyScene

  1. Foreground Injection Module是ControlNet架构自监督训练。

 

Image Editing through Text

Summarization

MDP

MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path

通用框架,类似survey。

 

Mask-Based

除了提供text,还需要指定需要编辑的区域,编辑时使用text-guided inpainting方法,保持unmask部分不变,参考Inpainting部分。

 

IIE

Guided Image Synthesis via Initial Image Editing in Diffusion Model

対生成图像不满意的地方,可以对xT对应的地方进行re-randomize;还可以通过移动生成图像中物体所在位置对应初始噪声区域来变换生成图像中物体的位置。

 

MaSaFusion

Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion

MaSaFusion

 

Mask-Free

难点在于如何保持图像除编辑外的背景和其它内容与原图一致。

 

Text-Guided SDEdit

baseline

 

LASPA (real image editing, no fine-tune)

LASPA: Latent Spatial Alignment for Fast Training-free Single Image Editing

LASPA

  1. 为了保持原图的细节,最直接的做法就是将原图注入生成过程中,SDEdit相当于只是单步注入,LASPA在每一步都注入,使用最简单的插值法。

 

LaF (real image editing, no fine-tune)

Text Guided Image Editing with Automatic Concept Locating and Forgetting

LaF-1

LaF-2

  1. Text-Guided SDEdit

  2. Text-Guided SDEdit方法会使编辑生成的concept受限于原图,如shape等,因此使用语法分析器分析出要忘记的concept cn,在生成时进行CFG采样:ϵθ(xt)+ω(ϵθ(xt,cp)ϵθ(xt))η(ϵθ(xt,cn)ϵθ(xt))

 

P2P (generated image editing, real image editing, no fine-tune)

Prompt-to-Prompt Image Editing with Cross Attention Control

P2P

  1. Imagen

  2. text2img模型生成的图片的结构主要由随机种子和cross-attention决定,通过保持随机种子不变(使用DDIM时就是控制起始噪声不变),操控cross-attention可以实现内容保持。

  3. 此方法并不是对已有图片做编辑,而是从高斯噪声开始的,并行地生成两张图,一张根据source prompt生成,一张根据target prompt生成(程序运行前并不知道原图是什么样),相当于两条并行的使用source prompt的reconstruction generative trajectory和使用target prompt的editing generative trajectory,前者为后者提供cross-attention map用于修改自身的cross-attention map以达到编辑的效果。

  4. 对Imagen的text 64x64模型的16x16 resolution的hybrid-attention中的cross-attention部分进行操作,不操作self-attention部分,super resolution模型还用Imagen原来的。

  5. KV都变成了visual token+target prompt token,对新的QK的计算结果即cross-attention map做操纵,主要有三种:word swap:除了被换的词,其它都用原来的cross-attention map;adding a new phrase:旧phrase部分都用原来的cross-attention map;attention re-weighting:给原来的cross-attention map要增强/减弱的词乘常数系数。

  6. 上述都是generated image editing方法,如果想做real image editing,需要进行DDIM Inversion。先使用source prompt对原图进行DDIM Inversion加噪,从得到的xT开始,再使用target prompt进行和上面一样的editing操作。因为编辑时是两条并行的generative trajectory(reconstruction和editing),它们必须要用一样的ω,且editing时需要用比较大的ω才有比较好的效果,所以reconstruction也需要用比较大的ω,所以DDIM Inversion需要使用较小的ω(参考Inversion部分),但重构效果依然比不上DDIM Inversion和reconstruction都使用较小的ω时的效果,编辑后的图片背景和原图有较大差异,此时就需要降低editing generative trajectory的ω。这就是 distortion-editability tradeoff (使用较小的ω,原图背景保持得很好,但编辑效果不好;使用较大的ω,编辑效果很好,但原图背景不能保持)。为此,本论文提供一种细粒度的解决方案,即使用用户提供的原图描述中的要编辑的词对应的attention map生成一个mask(阈值法),该mask会保护修改词之外的region,进行blended生成。

 

 

NTI (real image editing)

Null-text Inversion for Editing Real Images using Guided Diffusion Models

  1. StableDiffusion

  2. 解决P2P的real image editing时使用较小的ω进行DDIM Inversion后使用较大的ω进行editing时重构效果较差的问题。

  3. 先使用ω=1(需要source prompt)进行DDIM Inversion,记录所有zt,然后为每个zt分配一个null-text embedding ϕt,初始化z^T=zT,从t=T开始到t=0,使用z^t,source prompt,ϕtω=7.5生成zt1,使之靠近zt1,只优化ϕt,优化好后,使用z^tϕt以及ω=7.5生成z^t1,进入下一步。这样可以保证使用ω=7.5的情况下,有ϕt帮助,可以重构图像。

  4. editing时,从z^T=zT开始,使用target prompt和训练好的{ϕt}t=0T以及ω=7.5进行生成。可以是P2P模式也可以不是。

 

PTI (real image editing)

Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models

  1. StableDiffusion

  2. 不需要source prompt,所以DDIM Inversion时ω只能为0。

  3. 类似Null-text Inversion,先使用ω=0进行DDIM Inversion,记录所有zt,然后为每个zt分配一个conditional embedding ct,初始化z^T=zT,从t=T开始到t=0,使用z^tct以及ω=7.5生成zt1,使之靠近zt1,只优化ct,优化好后,使用z^tct以及ω=7.5生成z^t1,进入下一步。这样可以保证使用ω=7.5的情况下,有ct帮助,可以重构图像。

  4. editing是非P2P模式的:从zT开始,使用ω=7.5c=ηc+(1η)ct,其中η[0,1]η=0时就是重构图像,η=1时就是普通的DDIM Edit。

 

BARET (real image editing)

BARET: Balanced Attention based Real image Editing driven by Target-text Inversion

类似Prompt Tuning Inversion,但以target prompt embedding初始化ct进行优化。

 

NPI (real image editing)

Negative-prompt Inversion: Fast Image Inversion for Editing with Text-guided Diffusion Models

不用像Null-text Inversion优化ϕ,而是直接用source prompt替换ϕ

这样DDIM Inversion和reconstruction时,无论ω是多少,ϵ(xt,t)=ϵθ(xt,t,ϕ)+ω[ϵθ(xt,t,c)ϵθ(xt,t,c)]=ϵθ(xt,t,ϕ),保证了重构质量。

editing时使用source prompt作为negtive prompt。

NPI

 

ProxEdit (real image editing)

ProxEdit: Improving Tuning-Free Real Image Editing with Proximal Guidance

improved NPI

 

StyleDiffusion (real image editing)

StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing

StyleDiffusion-2

  1. 类似NTI,先使用ω=1(需要source prompt)进行DDIM Inversion,记录所有z^t,初始化z~T=z^T,从t=T开始到t=0,使用z~tω=7.5生成zt1,使之靠近z^t1,这一过程中source prompt的CLIP编码作为K,图像的CLIP编码经过一个可训练网络Mt的编码作为V,只优化Mt,优化好后,使用z~tMt以及ω=7.5生成z~t1,进入下一步。这样可以保证使用ω=7.5的情况下,有Mt​帮助,可以重构图像。

  2. 除了使用zt1z^t1之间的MSE loss之外,还是用cross-attention map之间的MSE loss。

  3. editing时,从z~T=z^T开始,使用target prompt和训练好的Mt以及ω=7.5进行生成。

 

DirectInv (real image editing)

Direct Inversion: Boosting Diffusion-based Editing with 3 Lines of Code

DirectInv-1

DirectInv-2

  1. P2P的reconstruction generative trajectory每一步都做修正,使zFsg与DDIM Inversion时的zIp相同,保证reconstruction generative trajectory重构结果与原图一致。

  2. training-free,不需要任何优化。

 

InfEdit (real image editing)

Inversion-Free Image Editing with Natural Language

DDIM选取σt=1α¯t,DDIM采样过程和Consistency Model Multistep采样过程一致,称 为Denoising Diffusion Consistent Model (DDCM)。

利用这一点就不需要对原图DDIM Inversion就可以进行编辑。

 

AdapEdit (real image editing)

AdapEdit: Spatio-Temporal Guided Adaptive Editing Algorithm for Text-Based Continuity-Sensitive Image Editing

基于P2P的soft editing。

 

FPE (real image editing)

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

FPE-1

FPE-2

  1. P2P是替换cross-attention map,但是需要找到real image的prompt,虽然可行但效果不好。本文发现替换self-attention map也是可以的。

  2. real image editing时,DDIM Inversion不需要prompt,reconstruction也不需要prompt。

 

DDS (real image editing)

Delta Denoising Score

DDS-1 DDS-2

 

  1. 将图像本身看成参数,就可以利用SDS进行编辑(输入target prompt,梯度更新图像),但这样会导致图像模糊,如图中上半部分。

  2. 导致这种情况的原因是SDS loss中含有偏离项,因此将SDS loss分为两项,一项是用于编辑的,一项是使得图像变模糊的偏离项。提出DDS loss,即θLDDS=(ϵϕ(zt,t,y)ϵϕ(z^t,t,y^))zθ,其中θ即为要编辑的图像,z就是θ(这里为了一般性,保留这种SDS的写法),z^为原图,y为target prompt,y^为原图的prompt,一般由target prompt修改得到,训练时使用相同的tϵ。显然,θLDDS=θLSDS(z,y)θLSDS(z^,y^)。对于任何paired image-text,如果使用θLSDS去更新图像也会导致图像变模糊,这表明paired image-text的θLSDS非零(是零的话它就不会更新图像了),但理想情况下它应该是零,因为image-text是pair的,不需要优化,所以这个paired image-text的非零的θLSDS就是导致图像模糊的偏离项。进一步发现,使用image-text都很相似的pair之间计算出来的θLSDS的norm是差不多的,所以可以近似将θLSDS(z^,y^)视为θLSDS(z,y)的偏离项,DDS即为去除了偏离项的优化目标。

  3. DDS对于每个编辑需求都要进行反向传播更新,比较消耗计算资源,进一步可以通过DDS训练一个编辑模型,如图所示。

 

Ground-A-Score (real image editing)

Ground-A-Score: Scaling Up the Score Distillation for Multi-Attribute Editing

Ground-A-Score

  1. 针对复杂编辑要求的DDS方法。

  2. 使用MLLM分解编辑需求和编辑区域,得到原图的prompt序列{xk}k=1n、target prompt序列{yk}k=1n和mask序列{mk}k=1n,使用分区域的DDS更新原图zLDDS=k=1nmk(ϵϕ(zt,t,yk)ϵϕ(z^t,t,xk))

 

DreamSampler (real image editing)

DreamSampler: Unifying Diffusion Sampling and Score Distillation for Image Manipulation

DreamSampler

  1. 将DDS的随机采样时间步和噪声改为在DDIM采样过程中进行score distillation,不需要提供原图的prompt也能进行DDS编辑。

  2. Specifically, in contrast to the original DDS method that adds newly sampled Gaussian noise to z0, DreamSampler adds the estimated noise by ϵθ in the previous timestep of reverse sampling. With initial noise computed by DDIM inversion, reverse sampling do not deviate significantly from the reconstruction trajectory even though source description is not given.

 

SmoothDiffusion (real image editing)

Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models

确保diffusion model的latent space smoothness,smooth latent spaces ensure that a perturbation on an input latent (xT) corresponds to a steady change in the output image (DDIM sampled x0).

做法是训练时加一个正则项Step-wise Variation Regularization。

ω=7.5的DDIM Inversion and Reconstruction有好处,从而也有利于editing。

SmoothDiffusion

 

IterInv (real image editing)

IterInv: Iterative Inversion for Pixel-Level T2I Models

针对含有super-resolution stage的inversion。

 

KV-Inversion (real image editing)

KV Inversion: KV Embeddings Learning for Text-Conditioned Real Image Action Editing

  1. The contents (texture and identity) are mainly controled in the self-attention layer, we choose to learn the K and V embeddings in the self-attention layer.

  2. 先使用ω=7.5(存疑)进行DDIM Inversion,记录所有zt,然后为每个zt准备self-attention的KV projection matrix的LoRA参数ψt,初始化z^T=zT,从t=T开始到t=0,使用z^tψt以及ω=7.5生成zt1,使之靠近zt1,优化ψt,优化好后,使用z^tψt以及ω=7.5生成z^t1,进入下一步。这样可以保证使用ω=7.5的情况下,有ψt帮助,可以重构图像。

  3. editing时,从zT开始,使用target prompt和训练好的{ψt}t=0T以及ω=7.5进行生成。

 

EDICT (real image editing, no fine-tune)

EDICT: Exact Diffusion Inversion via Coupled Transformations

  1. StableDiffusion

  2. 非P2P模式,直接用source prompt进行DDIM Inversion,然后用target prompt生成,都使用较大的 ω

  3. 利用Flow-based Generative Models中的Affine Coupling Layer的思想,设计了可逆的denoising过程,确保使用较大的 ω 时先加噪再重构可以还原原图。

 

AIDI (real image editing, no fine-tune)

Effective Real Image Editing with Accelerated Iterative Diffusion Inversion

AIDI

  1. DDIM的生成过程xt1=atxt+btϵ(xt,t)是不可逆的,DDIM Inversion做了ϵ(xt,t)ϵ(xt1,t1)的近似假设。如果不做近似假设,需要在已知xt1的情况下求解xt=1atxt1atbtϵ(xt,t),所以DDIM Inversion的每一步就转化成求函数f(xt)=1atxt1btatϵ(xt,t)的不动点,其中xt1是已知的,可以看成常数,使用求不动点的Anderson算法求得xt

  2. (这一条与本论文提出的算法无关)本论文发现,P2P中使用非对称的ω进行编辑效果会更好:DDIM Inversion使用ω=0(即纯图像DDIM Inversion,不使用CFG),reconstruction和editing generative trajectory使用较大的ω,效果比原始P2P都使用一样的ω要好,这与(Inversion-DDIM Inversion-4)中的结论一致;对于EDICT,DDIM Inversion使用ω=0,editing使用较大的ω,效果比原始EDICT都使用一样的ω要好。

  3. 本论文使用P2P算法,在DDIM Inversion时使用上述不动点算法,DDIM Inversion和reconstruction generative trajectory都使用较小的ω,for editing generative trajectory, we introduce a blended ω to apply larger guidance scales for pixels relevant to editing and lower ones for the rest to keep them unedited. 使用reconstruction generative trajectory的cross-attention map计算一个mask,由该mask决定which pixels are relevant to editing。

 

SPDInv (real image editing, no fine-tune)

Source Prompt Disentangled Inversion for Boosting Image Editability with Diffusion Models

  1. 类似AIDI,也是求不动点。DDIM的生成过程xt1=atxt+btϵ(xt,t)是不可逆的,DDIM Inversion做了ϵ(xt,t)ϵ(xt1,t1)的近似假设,于是有xt=1atxt1atbtϵ(xt1,t1),每一步得到近似的xt后,进一步使用f(xt)xt2梯度下降优化xt​求不动点。

  2. 可以用于多种编辑方法,如P2P,MasaCtrl,PNP,ELITE

 

FateZero (real image editing, no fine-tune)

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

  1. StableDiffusion

  2. P2P是将generative trajectory的cross-attention map注入到editing trajectory里,本论文直接将DDIM Inversion时的attention map注入到editing trajectory,此时就不需要generative trajectory了。这样做重构的效果也很好。

  3. 都使用较大的ω,DDIM Inversion时记录所有时间步的self-attention map和cross-attention map,编辑生成时,类似P2P,将prompt中没变的部分的cross-attention map替换成DDIM Inversion时的cross-attention map,同时替换所有self-attention map (to preserve the original structure and motion during the style and attribute editing)。

FateZero

 

MasaCtrl (real image editing, no fine-tune)

MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

Tuning-Free Inversion-Enhanced Control for Consistent Image Editing

  1. StableDiffusion

  2. 先用source prompt对原图进行DDIM Inversion加噪,从得到的xT开始,使用ω=7.5,类似P2P,分别用source prompt和target prompt进行reconstruction和editing。每一步,操作editing generative trajectory的UNet decoder的后几层的self-attention,用同一位置的reconstrcution generative trajectory的self-attention的KV替换editing generative trajectory的self-attention的KV(Q还是editing generative trajectory自己的)。

  3. 只修改UNet decoder的后几层的self-attention:the Query features in the shallow layers of U-Net (e.g., encoder part) cannot obtain clear layout and structure corresponding to the modified prompt。

  4. 只在中间的几步进行操作:performing self-attention control in the early steps can disrupt the layout formation of the target image. In the premature step, the target image layout has not yet been formed.

  5. 同时,每一步,两条generative trajectory都使用阈值法根据cross-attention map计算一个object的mask,限制editing generative trajectory的object区域的self-attention只参考reconstruction generative trajectory的object区域的信息。

MasaCtrl

  1. 相比于P2P只操控cross-attention,MasaCtrl只操控self-attention,操控cross-attention适合做物体增删,操控self-attention适合做动作改变。

 

MRGD (real image editing, no fine-tune)

Multi-Region Text-Driven Manipulation of Diffusion Imagery

  1. MultiDiffusion版本的P2P,对不同region进行编辑。

 

Object Variations (generated image editing, no fine-tune)

Localizing Object-level Shape Variations with Text-to-Image Diffusion Models

  1. StableDiffusion

  2. 对图像中某个物体做变换,而其它部分不改变,如将篮子变成盘子。两条并行的generative trajectory,在某个时间段内将句子中的单词替换。

Prompt-Mixing

  1. shape preservation:在cross-attention map上使用阈值法标定出某个需要shape preservation的word对应的object的位置,然后在之前的self-attention map中,将该object所有的pixel对应的self-attention map的行和列注入到新的generative trajectory上。也可以将要编辑的object标定出来,然后把标定之外的pixel当做背景,对这些pixel做shape preservation。

Object-Variations

  1. 使用Null-text Inversion可以做real image editing。

 

IP2P (real image editing, retrain)

InstructPix2Pix: Learning to Follow Image Editing Instructions

IP2P

  1. 利用GPT3,StableDiffusion,P2P(generated image editing)创建一个数据集,每条数据包含原图,原图描述,目标描述和目标图片,训练一个新的StableDiffusion,以原图和目标描述为条件,建模目标图片,这样在推理时就不需要原图描述了。

 

Emu Edit (real image editing, retrain)

Emu Edit: Precise Image Editing via Recognition and Generation Tasks

  1. 类似IP2P,创建新数据集进行训练。

  2. 像Emu一样,训练完后使用少量高质量数据进行fine-tune。

 

RP2P (real image editing, retrain)

ReasonPix2Pix: Instruction Reasoning Dataset for Advanced Image Editing

  1. We introduce ReasonPix2Pix, a dataset specifically tailored for instruction-based image editing with a focus on reasoning capabilities. 构造数据集时生成具有联想能力的instruction,比如使用the owner of the castle is a vampire代替make the castles dark.

  2. 原图和instruction输入MLLM,使用输出的feature和原图作为条件fine-tune StableDiffusion。

 

PbI (real image editing, retrain)

Paint by Inpaint: Learning to Add Image Objects by Removing Them First

PbI

  1. 与IP2P使用P2P构造数据集不同,PbI使用PbE的思想构造数据集。

  2. editing model和IP2P一样。

 

EditWorld (real image editing, retrain)

EditWorld: Simulating World Dynamics for Instruction-Following Image Editing

  1. 使用GPT生成input text,instruction和output text,使用SDXL根据input text生成Iori,根据生成过程中的cross-attention map使用阈值法得到每个text token的mask,取所有text token的mask的并作为Iori的前景mask,根据output text对Iori的前景进行inpainting得到Itar,inpainting时使用IP-Aadpter和ControlNet保持ItarIori的一致性,IoriItar​和instruction作为一条数据。

  2. editing model与IP2P一样。

 

EmoEdit (real image editing, retrain)

EmoEdit: Evoking Emotions through Image Manipulation

EmoEdit

  1. 根据emotion生成instruction,使用预训练IP2P进行编辑。

 

LIME (real image editing, retrain)

LIME: Localized Image Editing via Attention Regularization in Diffusion Models

  1. 预训练好的InstructPix2Pix。

  2. 提取原图的UNet features,resize,concat,normalization,聚类,得到segmentation。

  3. 提取目标描述中related token的cross-attention map,算出响应值最高的几个点,这几个点所在的segment拼在一起,即为RoI区域。

  4. 在IP2P生成时做blended editing,同时利用RoI修改cross-attention map,对于unrelated token的cross-attention map,RoI区域内的都减去一个较大的常数值,避免unrelated token对编辑造成影响。

 

FoI (real image editing, retrain)

Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation

  1. 预训练好的InstructPix2Pix。

  2. 从instruction中提取关键词,使用该关键词对应的cross-attention map,多次进行平方+norm的操作拉开高低值之间的差距,使用阈值法估算出一个mask。

  3. 对instruction的所有token的cross-attention map,mask区域内的响应值做增强,mask区域外的响应值使用ϵθ(zt,t,I,ϕ)(即null-instruction)的cross-attention map替换。

  4. 采样时,对sT(ϵθ(zt,t,I,T)ϵθ(zt,t,I,ϕ))乘上mask。

 

WYS (real image editing, retrain)

Watch Your Steps: Local Image and Scene Editing by Text Instructions

  1. 预训练好的InstructPix2Pix。

  2. 类似DiffEdit,在编辑之前先计算一个mask,在InstructPix2Pix生成时做blended editing。

 

ZONE (real image editing, retrain)

ZONE: Zero-Shot Instruction-Guided Local Editing

  1. 预训练好的InstructPix2Pix。

  2. description-guided model类似StableDiffusion的cross-attention map是token-wise的,instruction-guided model类似InstructPix2Pix的cross-attention map是consistent的。所以在InstructPix2Pix的cross-attention map上利用阈值法估计出一个mask。但这个mask过于粗糙,所以将InstructPix2Pix的编辑结果送入SAM,利用IoU选出重叠最大的segment作为mask。得到mask后,用原图的mask之外的部分替换InstructPix2Pix的编辑结果的mask之外的部分,再利用一些平滑操作去除artifact。

ZONE

 

VisII (real image editing, retrain)

Visual Instruction Inversion: Image Editing via Visual Prompting

  1. 基于IP2P做Visual Instruction的Textual Inversion。

  2. IP2P的输入是原图和instruction,输出是编辑后的图像。现给定一对原图和编辑后的图像的示例,在IP2P上利用TI的思想学习一个instruction的embedding,之后就可以把这个学到的instruction embedding用在其它图像上,实现与示例类似的编辑效果。

 

E4C (real image editing, fine-tune)

E4C: Enhance Editability for Text-Based Image Editing by Harnessing Efficient CLIP Guidance

E4C

  1. DDIM Inversion使用ω=0(即纯图像DDIM Inversion,不使用CFG),two branch模式,reconstruction generative trajectory和DirectInv一样,每一步不是使用上一步生成的zt,而是使用DDIM Inversion时得到的zt​​输入UNet,给editing generative trajectory提供KV或Q。

  2. Queries for structure and layout, whereas keys and values for textures and appearance. 对于保持layout的编辑,选择替换Q,此时就不需要下面的优化;对于需要编辑layout的编辑,选择替换KV,此时需要下面的优化。

  3. 类似DiffusionCLIP,两个loss优化Q的projection matrix,一个是CLIP direction loss LCLIP,一个是两个trajective最终生成的z0之间的MSE loss LReg

 

Imagic (real image editing, fine-tune)

Imagic: Text-Based Real Image Editing with Diffusion Models

  1. Imagen

  2. 只给原图和target prompt

  3. 先以target prompt embedding为起点,使用TI优化出一个source prompt embedding,之后fix source prompt embedding,fine-tune Imagen,之后使用source prompt embedding和target prompt embedding线性插值进行生成。

  4. 不fine-tune Imagen做不到图像保持,类似DragDiffusion,所以fine-tune很重要。

 

Forgedit (real image editing, fine-tune)

Forgedit: Text Guided Image Editing via Learning and Forgetting

  1. setting与Imagic相同,做法稍有差异。

  2. vision language joint learning:使用BLIP为原图生成source prompt,将source prompt输入CLIP得到source prompt embedding,再使用该embedding和原图一起fine-tune Imagen,这里embedding也参与优化。fine-tune Imagen时只更新一部分参数,并且发现The encoder of UNets learns the pose, angle and overall layout of the image. The decoder learns the appearance and textures instead.所以可以forget参数:If the target prompt tends to edit the pose and layout, we choose to forget parameters of encoder. If the target prompt aims to edit the appearance, the parameters of decoder should be forgotten.

  3. 生成时,计算target prompt embedding与优化得到的source prompt embedding正交的部分作为editing embedding,使用优化得到的source prompt embedding与editing embedding的线性组合进行生成,目的是为了保持原图细节。

 

DBEST (real image editing, fine-tune)

On Manipulating Scene Text in the Wild with Diffusion Models

DBEST

和Imagic顺序相反,因为这里提供了source prompt。

先fine-tune diffusion model,再使用预训练好的text recognition model的交叉熵loss优化target prompt embedding。

 

PNP (real image editing, no fine-tune)

Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation

  1. StableDiffusion

  2. 使用uncondtional DDIM Inversion(输入ϕ)编码原图到噪声,从同一起点开始两条并行的generative trajectory,一条使用ϕ一条使用target prompt,每一步对editing generative trajectory进行feature injection和self-attention map injection。和pix2pix-zero思想很像。

  3. feature injection:和MasaCtrl得出一样的结论,UNet深层的feature有更好的structure信息。使用reconstruction generative trajectory的UNet较深层的feature map替换editing generative trajectory的。但这样虽然很好了的保留了原图的structure信息,但也有一些纹理信息泄露到了生成图像中。

  4. self-attention map injection:使用reconstruction generative trajectory的self-attention map(Softmax(QKT))替换editing generative trajectory的,使得纹理信息保持一致。

 

Self-Guidance (real image editing, no fine-tune)

Diffusion Self-Guidance for Controllable Image Generation

用cross-attention map计算loss并求梯度作为guidance,实现物体移动、改变大小、改变外观等编辑功能。

 

Asymmetric Gradient Guidance (real image editing)

Improving Diffusion-based Image Translation using Asymmetric Gradient Guidance

结合了MCG和DDS的guidance方法,使用任意loss指导采样。

 

Asyrp (real image editing)

Diffusion Models Already Have a Semantic Latent Space

  1. λCLIPLdirection(Ptedit,ytarget;Ptsource,ysource)+λrecon|PteditPtsource| ,只优化 ft(h)=Δh

  2. 训练时先对数据集图像使用Sfor步DDIM Inversion到T,将得到的latents保存(可以重复利用),再从latents开始生成两条轨迹,都是生成Sedit步,但是只生成到tedit而不是0。一条是原本的轨迹,一条是不断shift的轨迹,每一步都用上述loss进行一次优化,类似DiffusionCLIP的GPU-efficient方法。

 

Interpretable h-space (real image edting)

Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models

  1. 虽然还是在h-space做,但采样不再是非对称的了,还是用原DDIM的公式,过UNet时修改h,这说明Asyrp中那个证明是不对的;而且这样每步生成时只需要过一次UNet了,更高效。

  2. unsupervised global: 生成一些样本,保存所有时间步的ht,对每个时间步t,用所有样本的ht计算PCA,得到n个主分量{vtj}j=0n,生成时每步使用h^t=ht+γvtj可以对任何图像进行编辑,不同时间步的第j个主分量具有相同的语义。

  3. unsupervised image-specific: 比如一个睁眼闭眼的编辑方向,对某张带着墨镜的人脸是没有意义的。使用类似h-space微分几何的方法,在h-space中找到能使ϵθ(xt)输出变化最大的方向。虽然是image-specific的,但是在某张图上找到的编辑方向也是可以应用于其它样本的。

  4. supervised: 使用标注的数据对,每对数据中正例含有某个属性,负例不含该属性,每对正例的ht减去负例的ht,取平均作为编辑方向,但是这样做有耦合的问题,因为正负例不可能完全只有一个属性的对立。使用类似找正交基的方法,每次计算一个编辑方向时,去除之前已发现的所有编辑方向的影响,最终得到解耦的编辑方向。

 

ChatFace (real image editing)

ChatFace: Chat-Guided Real Face Editing via Diffusion Latent Space Manipulation

CLIP Direcitonal Loss为Diff-AE的z训练一个网络预测Δz

 

ZIP (real image editing)

Zero-Shot Inversion Process for Image Attribute Editing with Diffusion Models

ZIP

 

Self-Discovering (real image editing)

Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation

self-discovering

 

GANTASTIC (real image editing)

GANTASTIC: GAN-based Transfer of Interpretable Directions for Disentangled Image Editing in Text-to-Image Diffusion Models

GANTASTIC

  1. 把StyleGAN已经学到的interpretable direction迁移到StbaleDiffusion上,使用两个loss学习一个CLIP text embedding d

  2. Llatent=Et,ϵ[ϵθ(xt,t,d)ϵθ(xt,t,d)22],让两者在d​的作用下差异化最大,即让d学习两者之间最大的差异在哪,和SDS异曲同工。

  3. Lsem=1cossim(EI(x),d)+cossim(EI(x),d)确保语义。

 

NoiseCLR (real image editing)

NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models

identify interpretable directions in text embedding space of text-to-image diffusion models

In noisy space, for edits carried out by the same direction to be attracted towards each other, while edits conducted by different directions to repel one another, in line with the core principles of contrastive learning.

NoiseCLR

 

Style Disentanglement (real image editing, no fine-tune)

Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models

  1. StableDiffusion

  2. c(0)是不带style的描述,c(1)是带style的描述,学习一个schedule λt,从相同的xT开始,使得用ct=λtc(0)+(1λt)c(1)生成的图像x0λ和用c(0)生成的图像x0(0)内容基本相似,且含有c(1)的style,Lclip(x0(0),c(0);x0λ,c(1))+βLperc(x0(0),x0λ)。类似DiffusionCLIP的优化方法,但只优化λt,不fine-tune StableDiffusion。

  3. 训练好之后还可用于image editing,只能用在符合c(0)描述的real image上,使用DDIM Inversion加噪得到xT后,使用ct作为条件生成。

 

SINE (real image editing, fine-tune)

SINE: SINgle Image Editing with Text-to-Image Diffusion Models

  1. StableDiffusion

  2. 类似DreamBooth,用原图和带有pseudo word的prompt,fine tune pseudo word embedding和StableDiffusion,每编辑一张图就要fine-tune一次模型。

  3. 提出Patch-Based Fine-Tuning,假设StableDiffusion LDM尺寸为p×p,Autoencoder输入尺寸为sp×sp,其中s为4或8。fine-tune时随机采样原图的一个patch,resize到sp×sp,同时将这个patch的位置信息编码输入到StableDiffusion中,一方面可以提高泛化能力,另一方面能使模型输出任意尺寸的图片。编辑生成时使用原图的位置信息。

  4. 编辑时使用model-based classifier-free guidance,把fine-tuned模型看作专门生成这个single image的unconditional模型。ωϵθ(zt,c)+(1ω)ϵθ(zt)=ω[vϵθ(zt,c)+(1v)ϵ^θ(zt,c^)]+(1ω)ϵθ(zt),其中ϵ^θ是fine-tuned模型,c^是"a photo/painting of a [*] [class noun]",c是target prompt。

  5. 不需要DDIM Inversion。

 

SEGA (generated image editing, no fine-tune)

SEGA: Instructing Diffusion using Semantic Dimensions

CFG的线性组合

 

DiffEdit (real image editing, no fine-tune)

DiffEdit: Diffusion-based Semantic Image Editing with Mask Guidance

  1. 自动计算mask的Blended Diffusion。

  2. 对于text-to-image模型,分别输入source prompt和ϕ,根据去噪的差异,估算一个mask;使用uncondtional DDIM Inversion(输入ϕ)编码原图到某一中间步;使用target prompt进行生成,每一步使用mask进行blended生成。

  3. 理论证明了,使用uncondtional DDIM Inversion加噪,比SDE直接一步加噪,重构效果更好。

 

DM-Align (real image editing, no fine-tune)

DM-Align: Leveraging the Power of Natural Language Instructions to Make Changes to Images

DM-Align

  1. 自动计算mask,转换为inpainting问题。

 

FISEdit (real image editing, no fine-tune)

FISEdit: Accelerating Text-to-image Editing via Cache-enabled Sparse Diffusion Inference

类似DiffEdit自动计算mask:利用P2P的方法操作cross-attention map,使用两个generative trajectory输出的feature map计算出difference mask记为要编辑的区域。

 

InstDiffEdit (real image editing, no fine-tune)

Towards Efficient Diffusion-Based Image Editing with Instant Attention Masks

类似DiffEdit自动计算mask:利用target prompt的start token对应的cross-attention map具有全局语义信息的性质,计算其余token的cross-attention map与其的相似度,使用最相似的那个token的cross-attention map,处理后估计一个mask。

 

Diff-AE & PDAE

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

Unsupervised Representation Learning from Pre-trained Probabilistic Diffusion Models

训练自编码器,在隐空间训练线性分类器,利用属性超平面的法向量作为编辑方向。

 

DisControlFace (real image editing)

DisControlFace: Disentangled Control for Personalized Facial Image Editing

使用预训练的Diff-AE,额外训练一个ControlNet引入控制信息,但这样训练有一个问题,the pre-trained Diff-AE backbone can already allow near-exact image reconstruction, only limited gradients can be generated during error back propagation, which are far from sufficient to effectively train ControlNet。所以引入masked-autoencoding的思想,训练时使用masked x0作为Diff-AE的输入,相当于训练ControlNet进行inpainting。

采样时,先估计出原图的控制信息,然后可以对控制信息进行编辑,再生成,同时使用0.750.5×(Tt)/T的动态mask stragety对原图进行mask,即每步输入的z都不一样。

 

UFIE

User-friendly Image Editing with Minimal Text Input: Leveraging Captioning and Injection Techniques

传统的编辑方法类似P2P需要用户提供source prompt和target prompt,本论文使用现有的caption模型为原图生成source prompt,用户只需要指出需要修改source prompt中哪些concept即可。

 

HIVE

HIVE: Harnessing Human Feedback for Instructional Visual Editing

  1. 训练一个StableDiffusion,以原图和target prompt为条件,对目标图像进行去噪。

  2. 引入human feedback,使用learned reward function fine-tune上述StableDiffusion。

 

DialogPaint

DialogPaint: A Dialog-based Image Editing Model

  1. StableDiffusion

  2. multi-turn editing

 

EMILIE

Iterative Multi-granular Image Editing using Diffusion Models

  1. StableDiffusion

  2. multi-turn editing,在StableDiffusion的latent space上进行多轮编辑。

 

MGIE

Guiding Instruction-based Image Editing via Multimodal Large Language Models

使用InstructPix2Pix的数据集,使用多模态大语言模型MLLM根据原图和instruction生成editing command,训练一个diffusion model,以原图和editing command为条件,生成目标图像。

 

DVP

Image Translation as Diffusion Visual Programmers

DVP

  1. CFG的strength很敏感,很小的改动会导致生成图像很大的不同,每张图都去要调整strength,不实用。受style transfer的instance normalization的启发,提出Instance Normalization Guidance:ϵ=σ(ϵu)conv(ϵuμ(ϵc)σ(ϵc))+μ(ϵu),其中conv()是一个1×1卷积。主要目的是降低ϵu的影响,因为ϵu的自由度太高了。

 

Image Editing through Reference Image

可以看成image-guided inpainting,参考Inpainting部分的text-guided inpainting,只是将条件从text换成了image。

 

ILVR

ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models

Latent Variable Refifinement: match the low-pass filter feature of noisy latents to that of reference image

注意,ILVR直接从xTN(0,I)开始生成。

ILVR

 

PbE

Paint by Example: Exemplar-based Image Editing with Diffusion Models

PbE

  1. StableDiffusion

  2. 输入原图,mask,reference image,输出原图mask部分被reference image取代并融合的图片。整体架构和text-guided image inpainting类似,将reference image看成text,作为condition输入到StabelDiffusion中,masked image和zt的concat作为输入,重新训练StableDiffusion,使用全图diffusion loss。

  3. self-supervised learning:使用带有bounding box的图像数据集进行自监督训练,即将bounding box内区域作为mask,bounding box内图片作为参考图片。这样训练时模型很容易过拟合,模型只学到学到一个简单的复制粘贴,提出两个解决方案:Information Bottleneck:因为我们需要将参考图片移植到原图mask区域,模型很容易去记忆图片空间信息而不是去理解上下文信息,所以我们将参考图片压缩,提高重构难度,即将其剪切并使用CLIP image encoder编码,结果作为StableDiffusion的KV进行cross-attention。Strong Augmentation:自己造的数据集存在domain gap between train-test,因为训练集中的参考图片本来就是原图切下来的,而测试集中基本都是无关的,所以我们对训练集中的参考图片进行数据增强(翻转、旋转、模糊等),又由于bounding box都是紧贴物体的,不利于模型泛化,所以对mask区域也进行数据增强,先用Bessel曲线拟合bounding box,再在曲线上均匀采样20个点,随机延伸1~5个像素点。

  4. 类似inpainting的blended采样。

  5. classifier-free guidance:20%的概率用可一个训练的向量替代CLIP image encoder编码结果,采样时guidance scale可以控制融合程度。

 

PbS

Reference-based Image Composition with Sketch via Structure-aware Diffusion Model

在PbE的基础上,还需要提供mask部分的sketch作为条件(concat),进一步提高可控性。

PbS

 

IMPRINT

IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation

IMPRINT

  1. 使用multi-view数据集训练一个image encoder(主体DINOv2 + 小adapter,两者都参与训练),输入一个view的图像生成embedding序列,送入StableDiffusion,重构另一个view的图像。训练image encoder和StableDiffusion的decoder。

  2. 固定image encoder的主体部分,重新训练一个diffusion model,自监督训练,image encoder的adapter也参与训练。

 

DreamInpainter

DreamInpainter: Text-Guided Subject-Driven Image Inpainting with Diffusion Models

两个条件,一个reference image,一个text,这样不仅可以将reference image填入mask,还能通过text进行控制,比如动作等。

之前的方法使用CLIP编码reference image,缺少了对细节的提取,这里使用预训练扩散模型UNet encoder编码reference image,时间步为0,取32×32×768的feature map,但如果直接使用该feature map作为条件会造成copy-paste过拟合,将其看成1024×768的向量,计算各像素之间cosine similarity,得到1024×1024的相似度矩阵,沿任一维求和,与其他像素相似度低的像素得分低,取得分最低的K个像素,得到K×768作为条件。

 

PhD

Paste, Inpaint and Harmonize via Denoising Subject-Driven Image Editing with Pre-Trained Diffusion Model

将exemplar去除背景,直接paste在目标区域,作为条件输入ControlNet进行类似PbE的self-supervised learning。

 

RefPaint

Reference-based Painterly Inpainting via Diffusion Crossing the Wild Reference Domain Gap

在Versatile Diffusion基础加了一个mask branch,reference image(训练时是被mask掉的部分)做context flow,masked image做mask branch,进行self-supervised的inpainting训练。

 

ObjectStitch

ObjectStitch: Generative Object Compositing

  1. 用的是pre-trained text2img diffusion model,由于给的是object图片而不是text,所以需要一个模块将object图片转换为text embedding,即content adaptor,类似TI:使用训练好的CLIP和大规模image-caption数据训练一个content adaptor,content adaptor将CLIP的image embedding映射到text embedding空间,得到translated embedding,然后让它尽量靠近CLIP的text embedding。训练好之后再用pre-trained text2img diffusion model和textual inversion方法fine-tune content adaptor。

  2. 固定content adaptor,fine-tune pre-trained text2img diffusion model。

  3. 类似inpainting的blended采样,diffusion model只输入translated embedding。

 

AnyDoor

AnyDoor: Zero-shot Object-level Image Customization

AnyDoor-1

AnyDoor-2 AnyDoor-3
  1. 使用DINOv2提取物体的ID tokens,既用了global token(1×1536),也用了patch tokens(256×1536),concat在一起后使用一个线性层映射为257×1024,代替text embedding输入cross-attention,代表了物体的全局特征,但是该特征丢失了物体细节,所以使用高通滤波提取物体细节特征,插入原图要放物体的地方,输入Detail Extractor(ControlNet架构),两者互补。训练时同时fine-tune UNet decoder。

  2. 之前的使用图像自监督训练的方法虽然有数据增强,但还是会导致多样性不足的问题,所以提出使用视频数据集造数据:对同一场景随机采样两帧,提取一帧的物体作为target,另一帧作为目标。

 

LAR-Gen

Locate, Assign, Refine: Taming Customized Image Inpainting with Text-Subject Guidance

LAR-Gen

  1. Locate:StableInpainting

  2. Assign:IP-Adapter

  3. 第一阶段:StableInpainting+IP-Adapter训练Diffusion UNet。

  4. 第二阶段:把第一阶段训练好的Diffusion UNet复制出一个RefineNet,RefineNet UNet decoder的self-attention前的feature送入Diffusion UNet,与对应的feature concat在一起进行self-attention,只训练RefineNet的image cross-attention。

  5. self-supervised learning,训练时subject image是从scene image中挖出来的,使用LLaVA生成subject image的caption作为text。

  6. blended采样。

 

PAIR-Diffusion

PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models

使用预训练模型提取图像的segmentation map作为图像的structure特征,再使用一个预训练的图像编码器编码图像,提取浅层feature map,取segmentation map中每个segment对应区域的feature map的spatial pool作为该segment的appearance特征,两者作为条件训练diffusion model。

structure编辑:对分割图进行编辑(比如改变某个object的形状、去掉某个object)

appearance编辑:提供一张reference image,用其全图的或者其中某个object的appearance特征替换某个segment对应的appearance特征,进行生成。

注意,编辑时不需要DDIM Inversion,直接根据条件从噪声开始生成即可。但毕竟structure和appearance不包含图像全部特征,所以未编辑部分会有一些变化。但编辑时可以对未编辑的segment进行mask,类似inpainting的blended采样。

 

CustomNet

CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

使用SAM提取出原图中的object和background,估算出object的viewpoints,使用zero-1-to-3生成一个随机viewpoints的novel view object,训练一个diffusion model,novel view object、background和viewpoints作为条件,预测原图。

生成时,可以指定object的角度、在图像中的位置以及背景。

 

Custom-Edit

Custom-Edit: Text-Guided Image Editing with Customized Diffusion Models

  1. 给定一张image和几张reference images,将image中某个object替换为reference image中的concept。

  2. Custom-Diffusion方法提取reference image中的concept到pseudo word。

  3. Prompt2Prompt + Null-text Inversion做real image editing,用pseudo word替换prompt中object对应的word。

 

DreamEdit

DreamEdit: Subject-driven Image Editing

和CustomEdit一样,但是基于mask的,DreamBooth做完TI后,做text-guided inpainting采样(blended)。

 

DreamCom

DreamCom: Finetuning Text-guided Inpainting Model for Image Composition

  1. self-supervised learning,给定3~5张reference images,每张都有bounding box (mask)标注其中物体,将mask和masked image concat在zt上,使用一个稀有的单词造句(如a sks cat)进行TI,同时fine-tune StableDiffusion。

  2. 生成时,给定背景图和想要object出现的位置的bounding box (mask),使用上述句子进行生成。

 

SpecRef

SpecRef: A Fast Training-free Baseline of Specific Reference-Condition Real Image Editing

有reference image的P2P。

SpecRef-1

SpecRef-2

 

Try-On

TryOnDiffusion

TryOnDiffusion: A Tale of Two UNets

TryOnDiffusion

  1. cascade模式

  2. 使用Parallel UNet是为了解决channel-wise concatenation效果不行的问题,所以改用cross-attention机制,绿线代表将feature当成KV送入主UNet。

 

StableVITON

StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On

StableVITON

  1. 使用PbE的StableDiffusion,其cross-attention是接收CLIP image embedding过一个MLP的结果。

  2. CLIP image embedding丢了很多信息,所以在decoder block之间再插入一个zero cross-attention block引入细节。

  3. 在text cross-attention里,某个word对应的cross-attention map是这个物体的大致轮廓,但是在zero cross-attention block里是image cross-attention,query里衣服上某个image token对应的cross-attention map应该是key中同样位置的image token,而非整个衣服区域,所以cross-attention map应该是尽量集中于一点的,所以额外使用了一个attention total variation loss, which is designed to enforce the center coordinates on the attention map uniformly distributed, thereby alleviating interference among attention scores located at dispersed positions. 即让query里不同image token对应的cross-attention map差异尽量大。

 

MMTryon

MMTryon: Multi-Modal Multi-Reference Control for High-Quality Fashion Generation

MMTryon

  1. StableDiffusion的cross-attention换为Multi-Modal Attention block,self-attention换为Multi-Reference Attention block。

 

TryOn-Adapter

TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On

TryOn-Adapter

 

PLTON

Product-Level Try-on: Characteristics-preserving Try-on with Realistic Clothes Shading and Wrinkles

PLTON

  1. 类似StableVITON,使用PbE的StableDiffusion,其cross-attention是接收CLIP image embedding过一个MLP的结果,Dynamic Extractor使用CLIP image encoder编码图像,但是之后的MLP是可训练的。

  2. HF-Map输入一个可训练的ControlNet。

 

StableGarment

StableGarment: Garment-Centric Generation via Stable Diffusion

StableGarment

 

DTC

Diffuse to Choose: Enriching Image Conditioned Inpainting in Latent Diffusion Models for Virtual Try-All

DTC

  1. Paint by Example是重新训练整个conditional StableDiffusion,这里改用ControlNet架构。

 

IDM-VTON

Improving Diffusion Models for Authentic Virtual Try-on in the Wild

IDM-VTON

 

Wear-Any-Way

Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment

Wear-Any-Way

利用semantic correspondecce,分别将穿着garment的person图像和garment图像输入同一个StableDiffusion,提取feature,计算相似性,可以得到correspondecce作为监督数据,这样生成时可以指定衣服的穿着方式,比如衣角扬起等。

 

AnyFit

AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario

AnyFit

 

TPD

Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On

TPD

 

FLDM-VTON

FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on

FLDM-VTON

  1. 使用额外的off-the-shelf clothes flattening network进行监督。

 

ShoeModel

ShoeModel: Learning to Wear on the User-specified Shoes via Diffusion Model

ShoeModel-1 ShoeModel-2
  1. 构造数据自监督训练。

 

Face

FaceStudio

FaceStudio: Put Your Face Everywhere in Seconds

人脸挖出来,自监督训练。

 

HS-Diffusion

HS-Diffusion: Semantic-Mixing Diffusion for Head Swapping

换头,预训练模型进行blended inpainting生成。

 

Stable-Makeup

Stable-Makeup: When Real-World Makeup Transfer Meets Diffusion Model

Stable-Makeup

使用ChatGPT生成不同makeup style的prompt,使用LEDITS对没有makeup的人脸图像进行编辑,生成带makeup的人脸图像,监督训练。

类似IP-Adapter,将CLIP提取的global token加patch tokens送入cross-attention。

 

MLLM

可以实现多种task,如text2img generation,personalization,editing等。

 

BLIP-Diffusion

BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

BLIP-Diffusion

  1. 利用BLIP的方法,先使用大规模image-text数据预训练一个multimodal image encoder,可以从image中提取text-aligned特征。

  2. 给定subject image和subject text,输入multimodal image encoder,得到subject image的特征,再训练一个MLP将其转化为text embedding。之后利用subject image构造训练image(如替换背景等)和对应的prompt,将subject image特征转化后的text embedding接在prompt之后,输入text encoder,输出再输入StableDiffuion进行训练。multimodal image encoder、MLP、text encoder和StableDiffuion一起训练。

  3. 给定subject image、subject text和prompt就能生成,不需要test-time fine-tune了。

 

UNIMO-G

UNIMO-G: Unified Image Generation through Multimodal Conditional Diffusion

UNIMO-G

  1. 自监督训练:先用caption模型得到图像的caption,再用Grounding DINO和SAM得到caption中的object的图像,将caption中的object word替换为图像,得到interleaved数据集,输入预训练MLLM进行编码,编码结果(所有token的last hidden layer的输出)送入StableDiffusion重构图像,只训练StableDiffusion。

  2. 因为MLLM输入中可能包含image entity,为了让生成结果更好地保持image entity的细节,在StableDiffusion的cross-attention增加zt和image entity的cross-attention,和TokenCompose一样对cross-attention map使用segmentation map进行监督(自监督训练,所以segmentation map是已知的),一方面可以提升训练效果,另一方面可以在推理时指定位置。

 

Kosmos-G

Kosmos-G: Generating Images in Context with Multimodal Large Language Models

Kosmos-G

Kosmos-G-AlignerNet

  1. MLLM:使用CLIP提取image embedding,use attentive pooling mechanism to reduce the number of image embeddings,用interleaved text image数据进行next-token prediction训练MLLM,CLIP的最后一层进行训练,只在text token上算loss,类似Emu2的caption阶段。

  2. AlignerNet:为了直接使用StableDiffusion(不需要训练)进行生成,训练一个AlignerNet,将Kosmos-G的输出转换到CLIP text embedding的domain,训练时只给一个text,分别使用Kosmos-G(所有token的last hidden layer的输出)和CLIP text encoder编码,得到st,训练一个Q-Former M,生成结果M(s)t计算MSE loss;为了防止the reduction in feature discrimination,再训练一个Q-Former N,生成结果N(M(s))s计算MSE loss,两个Q-Former一起训练。

  3. We can also align MLLM with Kosmos-G through directly using diffusion loss with the help of AlignerNet. While it is more costly and leads to worse performance under the same GPU days.

 

Emu2

Generative Multimodal Models are In-Context Learners

Emu2

  1. caption:使用CLIP提取image embedding,use mean pooling mechanism to reduce the number of image embeddings,用interleaved text image数据进行next-token prediction训练CLIP(注意不是MLLM),只在text token上算loss,该阶段的目的是得到一个image encoder。

  2. caption+regression:固定image encoder,用interleaved text image数据进行next-token prediction训练MLLM,在text token上算分类loss,在image feature上算regression loss。

  3. StableDiffusion:训练StableDiffusion对image encoder的编码结果进行解码。

 

GILL

Generating images with multimodal language models

GILL-1

GILL-2

  1. caption:类似LLaVa,用image-caption数据进行next-token prediction训练一个projection layer,只在text token上算loss。

  2. producing image:给LLM的embedding层加r个可训练的image token embedding,把caption+image输入,image统一用这r个image token embedding代替,回归这r个image token embedding,这里只训练这r个image token embedding。没看懂,根本没有引入image。

  3. 类似Kosmos-G,训练一个Q-Former将这r个image token embedding转化为CLIP text embedding,与caption的CLIP text embedding计算MSE loss。

 

TIE

TIE: Revolutionizing Text-based Image Editing for Complex-Prompt Following and High-Fidelity Editing

TIE

  1. 构造CoT数据fine-tune MLLM。

 

Image Editing through Point-based Supervision

Self-Guidance

Diffusion Self-Guidance for Controllable Image Generation

 

DragDiffusion

DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing

DragDiffusion-1

DragDiffusion-2 DragDiffusion-3

 

  1. StableDiffusion

  2. 先用待编辑图像LoRA fine-tune StableDiffusion,然后将图像DDIM Inversion到某个时间步tzt,使用UNet decoder第三层的output feature map做motion supervision和point tracking,不断梯度下降优化zt,最后从更新好的z^t开始DDIM去噪生成编辑后的样本。

 

DragNoise

Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation

DragNoise

  1. 和DragDiffusion一样,先用待编辑图像LoRA fine-tune StableDiffusion,然后将图像DDIM Inversion到某个时间步tzt,使用UNet decoder第三层的output feature map做motion supervision和point tracking,不断梯度下降优化zt,最后从更新好的z^t开始DDIM去噪生成编辑后的样本。

  2. we observe a forgetting issue where subsequent denoising processes tend to overlook the manipulation effect by simply performing diffusion semantic optimization on one timestep. Propagating the bottleneck feature to later timesteps does not have a significant influence on the overall semantics, we copy this optimized bottleneck feature s^t and substitute them in the subsequent timesteps.

 

EasyDrag

EasyDrag: Efficient Point-based Manipulation on Diffusion Models

EasyDrag

  1. 不需要LoRA fine-tune,直接将图像DDIM Inversion到某个时间步tzt,使用UNet decoder第二和第三层的output feature map,上采样到和zt相同的尺寸后concat在一起做motion supervision。

  2. motion supervision时,EasyDrag始终使用原图的zt对应的feature map的原始点附近的feature作为目标,而DragDiffusion是使用上一轮优化后的zt​对应的feature map的drag后的点附近的feature作为目标。

  3. reference guidance使用DDIM Inversion时的zt进行self-attention的KV替换。

 

StableDrag

StableDrag: Stable Dragging for Point-based Image Editing

StableDrag-1

StableDrag-2

  1. 在point tracking时,除了使用传统的training-free的差异计算法,还使用一个可训练的track model,其是一个可训练的1×1的卷积核,track model的训练只在原图以用户指定的starting point为中心的local patch上进行,训练完成后在后续的motion supervision和point tracking中全程使用。使用卷积核在local patch上进行卷积,得到一个和local patch同大小的score map,ground truth是以用户指定的starting point为中心的一个符合高斯分布的score map,两个map计算MSE优化卷积核。

  2. 在进行long-range drag时,图像内容难免会发生较大变化,point feature也会发生改变,此时让它和原图的starting point feature保持一致就不科学了,not only ensuring high-quality and comprehensive supervision at each step but also allowing for suitable modifications to accommodate the novel content creation for the updated states. 因此根据point tracking的结果计算一个confidence score,当confidence score较大时,就使用上一步的point feature作为监督优化latent,当confidence score较小时,就使用原图的starting point feature作为监督优化latent。

 

FreeDrag

FreeDrag: Feature Dragging for Reliable Point-based Image Editing

  1. feature dragging:之前的方法的point dragging是一片区域内的feature计算point-to-point的损失函数再求和,feature dragging计算一片区域内的feature aggregate Fr(hik)=qiΩ(hik,r)F(qi),再进行feature tracking Ldrag=i=1nFr(hik)Tik1,其中Tik+1=λikFr(hik)+(1λik)Tik是一个adaptive updating的template feature,Ti0=Fr(pi0)λi0=0λik根据Ldrag优化后的大小决定,如果优化后Ldrag较大,λik就设为较小的数,减少Tik+1的变化,如果优化后Ldrag较小,λik就设为较大的数,增大Tik+1的变化,这与StableDrag的思想一致。

  2. line search with backtracking:we constraint hik to the line extending from pi0 to ti.

 

DragonDiffusion

DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models

  1. StableDiffusion

  2. 从DIFT获取灵感,模型输出的feature具有correspondence性质,相同物体对应区域的feature具有很高的相似性。

  3. 类似P2P+self-guidance,两条并行的generative trajectory,一条是reconstruction,一条是editing,用各自第2,3层的输出feature(self-guidance是用attention)计算loss(原区域和目标区域的feature的相似度),求梯度作为guidance。

  4. 将editing generative trajectory的UNet decoder的self-attention的key-value替换为reconstruction generative trajectory的。

 

DiffEditor

DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing

  1. DragonDiffusion的改进版

  2. 先使用LAION训练一个image prompt encoder:具体做法是先使用预训练的CLIP image encoder将图像编码为长257的embedding sequence,作为cross-attention的key-value送入一个QFormer,输出长64的embedding sequence,送入StableDiffusion的cross-attention,只训练这个QFormer。在编辑时,在editing generative trajectory上使用原图的image prompt,效果更好。

  3. 作者发现如果在DragonDiffusion中使用随机初始化而非DDIM inversion得到的zT,编辑效果更好,但是其余无关细节也会发生改变,这再次证明了consistency和editing flexibility的两难困境。所以作者在编辑时,在某一段的DDIM生成中引入微量的随机性,即σt>0

  4. 利用RePaint的resample technique,即从zt生成zt1后,再加噪回到zt,重复该过程,可以避免一步生成不准确导致最终结果不和谐的问题。之前的resample technique都使用随机加噪,引入了不确定性,这里使用DDIM inversion进行确定性的加噪。

 

LucidDrag

Localize, Understand, Collaborate: Semantic-Aware Dragging via Intention Reasoner

LucidDrag

  1. editing guidance就是DragonDiffusion的guidance。

 

Pixel-wise Segmentation Guidance

Fine-grained Image Editing by Pixel-wise Guidance Using Diffusion Models

  1. 与SKG和Late-Constraint类似。

  2. 对于分割数据集{x0,y},将xt输入预训练扩散模型,使用UNet feature map训练一个语义分割模型。

  3. 对图像的segmentation map进行编辑,同时根据编辑结果计算一个mask,算mask-based方法。

  4. 编辑时先DDIM Inversion到某一中间步,再生成,将生成时的UNet feature map输入语义分割模型生成segmentation map,计算其和编辑后的segmentation map之间的loss,求梯度作为guidance。

 

Readout-Guidance

Readout Guidance: Learning Control from Diffusion Features

 

SDE-Drag

The Blessing of Randomness: SDE beats ODE in General Dfusion-based Image Editing

  1. 方法就是CycleDiffusion

  2. unified framework

SDE-Drag-1

The first stage initially produces an intermediate latent variable xt0 through either a noise-adding process (SDEdit) or an inversion process (DiffEdit). Then, the latent variable xt0 is manipulated manuall or transferred to a different data domain by changing the condition in a task-specific manner, resulting in x^t0.

The second stage starts from x^t0 and produces the edited image x^0 following either an ODE solver, an SDE Solver, or a Cycle-SDE process.

We show that the additional noise in the SDE formulation (including both the original SDE and Cycle-SDE) provides a way to reduce the gap caused by mismatched prior distributions (between p(x^t0) and p(xt0)), while the gap remains invariant in the ODE formulation, suggesting the blessing of randomness in diffusion-based image editing.

操控过的x^t0分布偏离了xt0的分布,但如果有了随机性,生成时,这个偏离会原来越小,如果没有随机性(即ODE),生成时,这个偏离会保持不变。

  1. Drag

SDE-Drag-2

When the target point is far from the source point, it is challenging to drag the content in a single operation. To this end, we divide the process of Drag-SDE into m steps along the segment joining the two points and each one moves an equal distance sequentially.

 

RotationDrag

RotationDrag: Point-based Image Editing with Rotated Diffusion Features

the point-based editing method under rotation scenario

 

Motion-Guidance

Motion Guidance: Diffusion-Based Image Editing with Differentiable Motion Estimators

  1. 使用a differentiable off-the-shelf optical flow estimator估算diffusion model每一步生成的x^0​与原图之间的光流,与用户给定的光流计算loss,求梯度作为guidance。

  2. 根据用户给定的光流估算一个mask,blended生成。

  3. 才用Repaint的resample technique。

 

Magic-Fixup

Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos

Magic-Fixup-1

Magic-Fixup-2

  1. detail extractor和synthesizer都用StableDiffusion初始化,都去掉了cross-attention,input block都做了扩展,两者都参与训练。

  2. 相当于给synthesizer的self-attention之后加了个cross-attention,Q是自己,KV是detail extractor的self-attention之前的feature。

  3. 类似AnyDoor,使用视频造数据进行训练。

  4. 生成时从α¯tIcoarse+1α¯tϵ开始,如果从标准高斯分布开始效果会差。

 

Model Editing

TIME

TIME: Editing Implicit Assumptions in Text-to-Image Diffusion Models

当prompt没有指明时,模型会做一些Implicit Assumptions进行生成,比如生成的玫瑰都是红色,医生都是男性。本方法将编辑这种Implicit Assumptions(是编辑,不是去除),比如将玫瑰是红色编辑为玫瑰是蓝色,这样模型以后再见到带有玫瑰的prompt时,就会默认生成蓝色的玫瑰。

做法是为所有cross-attention训练新的KV projection matrix,让新矩阵与玫瑰的乘积靠近原矩阵与蓝色玫瑰的乘积,这样新矩阵就会默认将玫瑰映射到原来模型里的蓝色玫瑰的投影。

 

UCE

Unified Concept Editing in Diffusion Models

UCE

和TIME类似,闭式解修改所有cross-attention的KV projection matrix。

 

MACE

MACE: Mass Concept Erasure in Diffusion Models

MACE

最后的融合多个LoRA成一个LoRA的方法类似Mix-of-Show中的方法。

 

SLD

Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models

SLD

 

ESD

Erasing Concepts from Diffusion Models

  1. fine-tune StableDiffusion

  2. 反向编辑,对图像中与文本相关的内容进行擦除。

  3. 反向利用classifier guidance,fine-tune模型让预测的噪声与预训练模型的反向classifier guidance的噪声靠近。

ESD

 

AC

Ablating Concepts in Text-to-Image Diffusion Models

  1. 让StableDiffuion忘记一些concept,比如使用带有"in the style of Van Gogh"的prompt时,模型就会忽略"Van Gogh",生成正常style的图片。

  2. 使用"in the style of Van Gogh"构造一些prompt c,去掉"in the style of Van Gogh"就能得到对应的原prompt c,使用c生成的图片作为训练数据,fine-tune StableDiffuion,目标是对于同一个输入xt,使得以c为条件的输出靠近以c为条件的输出,以c为条件时disable grad。

 

Unlearning

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

Unlearning

  1. 对抗训练,让模型在grumpy cat和cat时预测的noise无法分辨,这样修改后的模型遇到grumpy cat时会按cat生成,忽略grumpy。

 

FMN

Forget-Me-Not: Learning to Forget in Text-to-Image Diffusion Models

  1. 对于一些想让StableDiffuion忘记的concept,收集一些reference images,并用concept造一些prompt,fine-tune整个StableDiffusion,loss为所有cross-attention layer处的concept对应的cross-attention map的所有响应值的平方和。

  2. 注意fine-tune时不需要diffusion loss。

 

PCE

Pruning for Robust Concept Erasing in Diffusion Models

  1. Our method selectively prunes critical parameters associated with the concepts targeted for removal, thereby reducing the sensitivity of concept-related neurons. Our method can be easily integrated with existing concept-erasing techniques, offering a robust improvement against adversarial inputs.

  2. stage 1: We use a numerical criterion to identify concept neurons.

  3. stage 2: We validate concept neurons are sensitive to adversarial prompts.

 

ConceptPrune

ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning

  1. We first identify critical regions within pre-trained models responsible for generating undesirable concepts, thereby facilitating straightforward concept unlearning via weight pruning.

 

Prompt-Tuning-Erase

Removing Undesirable Concepts in Text-to-Image Generative Models with Learnable Prompts

  1. 学习一个prompt embedding,其可以直接concat在CLIP text emebdding后送入cross-attention。

  2. 类似EM算法,轮流更新prompt embedding pk和StableDiffusion ϵθk,假设原StableDiffusion为ϵθ,带有erase concept的prompt为ce。更新pk时,最小化ϵθk(ce,p)ϵθ(ce),这是为了将ce的知识transfer到p中,因为θk已经被优化去除了ce的知识,为了使loss更小,p必须去学ce的知识,一步更新得到pk+1;更新θk时使用两个loss,一个是ϵθk(ce)ϵθ(),让模型去除ce的知识,另一个是ϵθk(ce,p)ϵθ()作为正则项,一步更新得到θk+1。如此循环直到收敛。

 

SuppressEOT

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

SuppressEOT-1

SuppressEOT-2

  1. 只针对"... without xxx"句型的prompt仍然生成带有"xxx"的图像的情况。

  2. zero xxx和zero EOT都不解决问题,只有同时zero才有效;EOT之间距离也很近。

  3. 对x和EOT的矩阵((N|p|1)×768)做SVD,the main singular values are corresponding to the suppressed information (the negative target),所以做奇异值的抑制,之后复原再进行生成。

 

SepCE4MU

Separable Multi-Concept Erasure from Diffusion Models

 

All but One

All but One: Surgical Concept Erasing with Model Preservation in Text-to-Image Diffusion Models

 

Geom-Erasing

Geom-Erasing: Geometry-Driven Removal of Implicit Concept in Diffusion Models

使用带有二维码、水印、文字的image-text pair数据集,将二维码、水印、文字的位置信息加进text,fine-tune StableDiffusion,这样生成时只用原text就可以避免生成二维码、水印、文字。

Geom-Erasing

 

Ring-A-Bell

Ring-A-Bell! How Reliable are Concept Removal Methods for Diffusion Models?

 

Diff-QuickFix

Localizing and Editing Knowledge in Text-to-Image Generative Models

不同属性的知识(objects style color action)分布在UNet中不同block中,只对想要编辑或者ablate的concept对应的属性对应的block做fine-tune。

 

EraseDiff

EraseDiff: Erasing Data Influence in Diffusion Models

在训练时,对于需要遗忘的数据使用非高斯分布的噪声进行加噪,这样采样时就不会生成这些数据。

 

TV

Robust Concept Erasure Using Task Vectors

 

Editioning

Training-free Editioning of Text-to-Image Models

  1. 和erasing相反,让模型专注于某个concept的生成。

 

Image-to-Image Translation

SDEdit (no fine-tune)

SDEdit: Image Synthesis and Editing with Stochastic Differential Equations

需要source domain和target domain上训练好的diffusion model。

 

Inversion-by-Inversion (no fine-tune)

Inversion-by-Inversion: Exemplar-based Sketch-to-Photo Synthesis via Stochastic Differential Equations without Training

two-stage SDEdit

 

UNIT-DDPM (retrain)

UNIT-DDPM: UNpaired Image Translation with Denoising Diffusion Probabilistic Models

Geom-Erasing UNIT-inference
  1. unpaired

  2. domain translation function提取domain信息。

 

LaDiffGAN (retrain)

LaDiffGAN: Training GANs with Diffusion Supervision in Latent Spaces

  1. 类似Diff-Instruct,使用diffusion model训练image-to-image translation的GAN模型。

 

CycleNet (retrain)

CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation

  1. unpaired

  2. 使用ControlNet将源域x0引入作为条件 ϵθ(yt,cy,x0),依赖域内重构和源域目标域之间互相translation的一致性进行训练。

CycleNet

 

Palette (retrain)

Palette: Image-to-Image diffusion models

  1. paired,self-supervised learning,自动生成paired数据,如colorization,inpainting等

  2. condition source image through concatenation

 

DDBM (retrain)

Denoising Diffusion Bridge Models

DDBM

  1. paired

  2. 扩散过程变成从一个分布的point扩散到另一个分布的paired point,修改了公式进行训练,qtTtT的前向过程,类似q(xt|x0)一样有闭式解。

  3. 和ShiftDDPMs类似。

 

DBIM (retrain)

Diffusion Bridge Implicit Models

  1. DDBM的DDIM版,DBIM相对于DDBM相当于DDIM相对于DDPM,可以使用预训练好的DDBM进行加速采样。

 

ILVR (no fine-tune)

ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models

unpaired

 

DiffusionCLIP (fine-tune)

DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation

  1. unpaired

  2. 一个单词对应一个domain对应一个fine-tuned模型。

  3. pre-trained unoncditional DDPM + pre-trained CLIP

  4. 对数据集图像使用Sfor步DDIM Inversion到return step t0,将得到的latents保存(可以重复利用),再从这些latents开始生成Sgen步,用最后得到的x0计算CLIP directional loss,对DDPM进行一次fine-tune,类似递归神经网络。

  5. GPU-efficient:从latents开始生成的Sgen步中,每一步得到的x^0计算CLIP directional loss,对DDPM进行一次fine-tune,相当于同一批样本要对网络fine-tune Sgen步。

 

Rectifier (fine-tune)

High-Fidelity Diffusion-based Image Editing

  1. unpaired

  2. DiffusionCLIP

  3. 训练网络预测卷积层的LoRA参数,这样不需要像DiffusionCLIP那样递归优化。

rectifier-1

rectifier-2

 

EGSDE (no fine-tune)

EGSDE: Unpaired Image-to-Image Translation via Energy-Guided Stochastic Differential Equations

  1. 只需要target domain上训练好的diffusion model,给定source domain上的原图做SDEdit,使用两个预训练好的energy function进行指导采样。

  2. 改变domain-specific特征: 训练一个domain classifier,去除分类层变为一个编码器,计算生成的latent和原图的noisy latent的feature之间的cosine similarity,求梯度作为guidance。

  3. 保留domain-independent特征: low-pass filter,计算生成的latent和原图的noisy latent的低通滤波之间的L2距离,求梯度作为guidance。

 

DDIB (no fine-tune)

Dual Diffusion Implicit Bridges for Image-to-Image Translation

  1. 需要source domain和target domain上训练好的diffusion model。

  2. Probability Flow ODE在source domain和target domain之间构成薛定谔桥。

  3. Cycle Consistency:source domain上的样本x0,用source domain上训练的diffusion model的Probability Flow ODE加噪到xT,用target domian上训练的diffusion model的Probability Flow ODE降噪到x0,再对x0进行同样的操作,若Probability Flow ODE的离散化误差为0(DDIM是Probability Flow ODE的一种误差很小的离散化),则可以完全复原x0

  4. cycle的前半段即为translation。

 

DECDM (no fine-tune)

DECDM: Document Enhancement using Cycle-Consistent Diffusion Models

  1. DDIB在文档模型上的应用。

 

CycleDiffusion (no fine-tune)

Unifying Diffusion Models' Latent Space, With Applications to Cyclediffusion and Guidance

  1. 需要source domain和target domain上训练好的diffusion model。

  2. translation过程和DDIB一样,但使用DPMEncoder代替Probability Flow ODE。

  3. 如果使用同一个text-to-image模型,两个不同text作为condition,可以分别看成source domain和target domain上训练好的DPM,可以用这种方法既可以做image-to-image translation也可以做image editing。

先用source domain模型编码

DPM-Encoder

再用target domain模型解码

DPM-Decoder

注意DPM-Encoder是针对stochastic diffusion models的。

 

DDPM Inversion (no fine-tune)

An Edit Friendly DDPM Noise Space

方法同DPM-Encoder(作者声称和DPM-Encoder不一样,但并没有看出有什么区别,有可能说的是之前版本的DPM-Encoder?)

 

LEDITS (no fine-tune)

LEDITS: Real Image Editing with DDPM Inversion and Semantic Guidance

DDPM Inversion+SEGA(多个guidance的combination)

 

LEDITS++ (no fine-tune)

LEDITS++: Limitless Image Editing using Text-to-Image Models

用DPM-Solver做inversion,同时使用cross-attention map和DiffEdit的方法估计mask,做mask-based editing。

 

Pix2Pix-Zero (no fine-tune)

Zero-shot Image-to-Image Translation

Pix2Pix-Zero

  1. 需要预训练好的StableDiffusion,做类似cat dog这样的zero-shot image-to-image translation,不同text输入StableDiffusion可以看成不同domain上训练好的diffusion model。

  2. 使用BLIP生成原图(cat图)的描述,使用CLIP对描述进行编码得到c;使用GPT3分别对cat和dog造大量句子,然后使用CLIP编码每个句子,算出它们的均值差Δc

  3. 使用c对图像做regularized DDIM Inversion,得到xT。DDIM Inversion时每一步使用两个loss梯度下降优化ϵθ的预测结果(代码里用的是ϵθ(zt,t,c),没有使用cfg),一个loss计算不同位置之间的相关性,另一个loss计算每个位置和标准高斯分布的KL散度。

  4. xTc对原图进行重构,存下各时间步的cross-attention map Mtref;之后用xTc+Δc进行生成,在每个时间步,先过一次UNet计算cross-attention map Mtedit,利用||MtrefMtedit||2xt进行一步优化,再用优化后的xtc+Δc过一次UNet预测xt1

 

CDM (retrain)

  1. unpaired

  2. 训练diffusion model的同时训练两个encoder,一个编码content,一个编码style,利用inductive bias,content是一个spatial layout mask,使用时降/上采样到feature map的尺寸;style是一个向量,代表高维语义。在UNet每一层用AdaGN,style做channel-wise affine transformation,content和AdaGN输出做spatial上的乘。

  3. 采样时,先用自身编码结果DDIM Inversion到噪声,再用目标图像的content或style进行生成。

 

DiffuseIT (no fine-tune)

Diffusion-based Image Translation using Disentangled Style and Content Representation

SDEdit + guidace + resample technique

DiffuseIT

 

Few-Shot Diffusion (fine-tune)

Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption

  1. unpaired

  2. 使用source domain A上预训练好的diffusion model和少量的target domain B上的样本做model adaption,得到一个target domain的diffusion model。做法是使用source domain A上预训练好的diffusion model初始化模型,使用任意source domain的图片xA translate后的结果x0AB对模型进行fine-tune(类似DiffusionCLIP),目标为Directional Distribution Consistency Loss (between xA and x0AB) + Gram Matrix Style Loss (between x0AB and xB) + diffusion loss。其中diffusion loss由target domain B上的样本进行训练,x0AB为target domain A上的样本用Tweedie's formula根据diffusion model输出计算的x^0

  3. Directional Distribution Consistency Loss:先使用数据集和CLIP得到一个cross-domain direction vector w=1mi=1mE(xiB)1ni=1nE(xiA) ,loss为LDDC=E(xA)+w,E(x0AB)

  4. 做translation时类似SDEdit,只用target domain的diffusion model。

 

Fine-grained Appearance Transfer (no fine-tune)

Fine-grained Appearance Transfer with Diffusion Models

  1. unpaired

  2. 利用DIFT做semantic matching和feature transfer。

Fine-grained Appearance Transfer

 

S2ST

S2ST: Image-to-Image Translation in the Seed Space of Latent Diffusion

  1. unpaired

  2. 将原图DDIM Inversion到zT,再从zT开始生成,使用生成结果计算loss,优化zT。之后从优化后的zT开始,类似Null-Text Inversion,边优化边生成。

structure loss: 生成的图像和原图的sobel gradient之间的MSE loss

appearance loss: 选取几张target domain的图像,autoencoder编码后计算均值,和生成的z0计算MSE loss

S2ST

 

FCDiffusion (fine-tune)

FCDiffusion-1

FCDiffusion-2

  1. 自监督训练ControlNet。Training to reconstruct the lossless image features z0 with the lossy control signal c=FFM(z0) and the paired text prompt y​.

  2. translation时,先对原图进行DDIM Inversion,再使用target prompt和原图的不同frequency进行生成。

 

Style Transfer

DiffStyler

DiffStyler: Controllable Dual Diffusion for Text-Driven Image Stylization

  1. 文本提供style

  2. 加噪150 steps,去噪50 steps,每一步用Tweedie's formula根据xtϵθ(xt,t)计算x^0,对x^0使用各种loss进行优化。

 

ZeCon

Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer

  1. 文本提供style

  2. 对原图先加噪到中间步,再从中间步噪声开始去噪,每一步用Tweedie's formula根据xt计算x^0,用x^0计算CLIP loss + contrastive loss(用于content preservation),求梯度作为guidance。

ZeCon

 

StyleAdapter

StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation

  1. reference images提供style。

  2. StableDiffusion

  3. 使用CLIP编码所有reference images,输入一个可训练的StyEmb网络得到style feature,给StableDiffusion插入一个可训练的cross-attention层,image token与style feature做cross-attention,其输出与text cross-attention层输出加在一起送到下一层(Tow-Path Cross-Attention),只训练StyEmb和新插入的cross-attention层。

  4. For data augmentation, we apply the random crop, resize, horizontal flipping, rotation, etc., to generate K = 3 style references for each input image during training.

 

ArtFusion

ArtFusion: Arbitrary Style Transfer using Dual Conditional Latent Diffusion Models

  1. reference images提供style。

  2. 训练一个以content和style为条件的diffusion model,以输入数据自身的content(LDM的VAE提取)和style(vgg feature)为条件做self-reconstruction,采样时使用不同的content image和style image。

 

SGDiff

SGDiff: A Style Guided Diffusion Model for Fashion Synthesis

  1. reference images提供style。

  2. 类似ArtFusion,使用输入数据的patch作为style。

 

StyleDiffusion

StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models

  1. reference images提供style。

  2. 先使用预训练Style Removal模型去除原图和reference image的style,类似DiffusionCLIP,用CLIP directional loss fine-tune模型,一个style一个模型,在CLIP image embedding空间,上面两个的差应该和下面两个的差相似。

StyleDiffusion-2

OSASIS

One-Shot Structure-Aware Stylized Image Synthesis

OSASIS

  1. 给定IBstyle和在domain A上训练的diffusion model ϵθ和Diff-AE ϵA,fine-tune得到一个domain B上的ϵB

  2. 利用SDEdit方法,使用ϵθ生成IBstyle对应的IAstyle,再使用ϵθ生成任意的domain A的样本IAin,之后使用ϵAIAstyleIAin加噪到t0,之后复制ϵAϵB,两边同时生成,使用CLIP directional loss fine-tune ϵB,并使用style image的reconstruction loss作为正则。

  3. SPN:a structure-preserving network (SPN), which utilizes a 1×1 convolution that effectively preserves the spatial information and structural integrity of IAinxtSPN=SPN(IAin)xt=xt+λxtSPNxt作为ϵB的输入。

 

CartoonDiff

Cartoondiff: Training-free Cartoon Image Generation with Diffusion Transformer Models

不做任何训练,只对预测的ϵθ做一个normalization实现卡通化,normalization can suppress the generation of fine texture details。

 

ControlStyle

ControlStyle: Text-Driven Stylized Image Generation Using Diffusion Priors

ControlNet + DiffusionCLIP

ControlStyle

 

 

PortraitDiffuion

Portrait Diffusion: Training-free Face Stylization with Chain-of-Painting

Style Self-Attention Control

PortraitDiffusion

Ditail

Diffusion Cocktail: Fused Generation from Diffusion Models

通常对于每个style会fine-tune得到一个模型,使用任意一对模型做any-to-any style transfer,将一个模型生成的图像作为content,用另一个模型对其进行style transfer。

做法类似PnP,做feature和self-attention map的注入,但不同的是,由于保存原图的feature和self-attention map太消耗存储,所以本文提出只保存原图生成过程中的latent,在style transfer时由当前模型再推理一次得到feature和self-attention map,效果和使用原模型的feature和self-attention map相差无几。

 

DiffStyle

Training-free Content Injection using h-space in Diffusion Models

DiffStyle

 

ColorizeDiffusion

ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text

自监督训练,提取图像的sketch,对图像加噪,加噪结果和sketch加在一起作为UNet的输入,将UNet的cross-attention改造为linear层,使用预训练CLIP提取图像的image embedding送入linear层,所有参数一起训练进行重构。

采样时,提取原图的sketch和reference image的CLIP image embedding输入网络进行生成,保持原图的结构,完成风格向reference image的转化。

还可以使用文本对reference image embedding进行manipulation,因为CLIP的text embedding和image embedding已经对齐了,所以可以在CLIP embedding空间,根据所给text和scale对reference image embedding进行manipulation。

 

HiCAST

HiCAST: Highly Customized Arbitrary Style Transfer with Adapter Enhanced Diffusion Models

 

FreeStyle

FreeStyle: Free Lunch for Text-guided Style Transfer using Diffusion Models

FreeStyle

  1. training-free

  2. 受FreeU的启发,将原图(content image)送入UNet encoder+decoder得到的feature作为backbone feature,含有大量低频信息(content),给其乘一个系数;xt送入UNet encoder得到的feature作为skip feature,含有大量高频信息(style),给其FFT乘一个系数再iFFT。

 

ASI

Tuning-Free Adaptive Style Incorporation for Structure-Consistent Text-Driven Style Transfer

ASI

  1. training-free

  2. P2P直接将style prompt拼在original prompt(即content prompt)之后进行text-guided style transfer,但这种方式会破坏原图信息,如头发等。

  3. 使用content prompt和style prompt分别进行cross-attention得到feature FcFs。根据FcFs之间的distribution difference计算一个mask,差异度较大的地方设为1,其余为0;使用阈值法计算一个mask,Fc中数值较大的地方设为0,其余为1。两个mask取OR,其中为1的地方就是要改变的地方,为0的地方就是要保持的地方,使用类似AdaIN的技术,σ(Fs)(Fcμ(Fc)σ(Fc)+μ(Fs))M+Fc(1M)

 

InST

Inversion-Based Creativity Transfer with Diffusion Models

  1. StableDiffusion

  2. 用CLIP编码reference image,训练一个网络,根据image embedding预测一个token embedding(非CLIP编码),输入到预训练好的StableDiffusion(先过CLIP),用TI方法训练这个网络。

 

ArtBank

ArtBank: Artistic Style Transfer with Pre-trained Diffusion Model and Implicit Style Prompt Bank

ArtBank

  1. ISPB:每个style对应一个learnable parameter matrix,由该style专属的SSAM转化为一个token embedding,使用该style的一些images进行TI训练,只优化ISPB。

  2. Stochastic Inversion:Random noise is hard to predict, and incorrectly predicted noise can cause a content mismatch between the stylized image and the content image. To this end, we first add random noise to the content image and use the denoising U-Net in the diffusion model to predict the noise in the image. The predicted noise is used as the initial input noise during inference to preserve content structure.

 

LSAST

Towards Highly Realistic Artistic Style Transfer via Stable Diffusion with Step-aware and Layer-aware Prompt

LSAST

  1. 类似ProSpect,将1000步均分为10个阶段,再将UNet分为3个部分,每个阶段每个部分使用一个单独的token embedding,使用一些style images进行TI训练。

  2. 生成时,除了DDIM Inversion,还使用一个预训练的edge的ControlNet保持content image的结构。

 

StyleBooth

StyleBooth: Image Style Editing with Multimodal Instruction

StyleBooth

  1. TI-based style transfer.

  2. InstructPix2Pix,构造数据集训练W同时fine-tune InstructPix2Pix。

 

Pair-Customization

Customizing Text-to-Image Models with a Single Image Pair

Pair-Customization

  1. TI-based style transfer.

 

RB-Modulation

RB-Modulation: Training-Free Personalization of Diffusion Models using Stochastic Optimal Control

 

Inverse Problem

DPS

Diffusion Posterior Sampling for General Noisy Inverse Problems

Diffusion Posterior Proximal Sampling for Image Restoration

Solving General Noisy Inverse Problem via Posterior Sampling: A Policy Gradient Viewpoint

Improving Diffusion-Based Image Restoration with Error Contraction and Error Correction

Consistency Models Improve Diffusion Inverse Solvers

Deep Data Consistency: a Fast and Robust Diffusion Model-based Solver for Inverse Problems

Learning Diffusion Priors from Observations by Expectation Maximization

  1. y=Hx+ϵ, guidance:xtyHx^0

 

MCG

Improving Diffusion Models for Inverse Problems using Manifold Constraints

  1. y=Hx+ϵ, guidance:xtW(yHx^0)

 

DEFT

DEFT: Efficient Finetuning of Conditional Diffusion Models by Learning the Generalised h-transform

  1. 类似PDAE,使用inverse problem的pair data训练一个gradient estimator进行指导采样。

 

DreamGuider

DreamGuider: Improved Training free Diffusion-based Conditional Generation

 

STSL

Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion

  1. 之前的方法都用first order Tweedie's formula计算x^0,本论文使用second order Tweedie's formula。

 

DMPlug

DMPlug: A Plug-in Method for Solving Inverse Problems with Diffusion Models

DMPlug

  1. 左边是类似DPS的方法,使用DDIM每一步预测的x^0与measurement计算loss进行约束,DMPlug将DDIM视为一个函数R,不断生成x0=R(xT)与measurement计算loss优化xT

 

CI2RM

Fast Samplers for Inverse Problems in Iterative Refinement Models

  1. Conditional Conjugate Integrators

 

SBD

Reducing the cost of Posterior Sampling in Linear Inverse Problems via task-dependent Score Learning

 

Steered-Diffusion

Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis

 

FDEM

Fast Diffusion EM: a diffusion model for blind inverse problems with application to deconvolution

  1. real world:y=Hx+ϵ,只知道yϵ的noisy level,求xH,这是就有未观测变量H,所以可以使用EM算法求解。

  2. 或许也可以用Variational Inference

 

Adapt and Diffuse: Sample-adaptive Reconstruction via Latent Diffusion Models

  1. y=Hx+n,训练时,利用某个预训练的autoencoder,先用encoder编码xz,再新训练一个网络编码yz,目标是让z靠近z,并且decoder能从z复原x

  2. 推理时仅有y,利用Langevin采样得到z,decoder解码得到x。具体做法是利用ztH(D(z^0(zt)))y作为drift进行Langevin采样,其中D为预训练autoencoder的decoder。

 

Restoration

Non-Blind

DDRM

Denoising Diffusion Restoration Models

 

DDNM

Zero-Shot Image Restoration Using Denoising Diffusion Null-Space Model

 

DDPG

Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance

 

IR-SDE

Image Restoration with Mean-Reverting Stochastic Differential Equations

IR-SDE

  1. ShiftDDPMs中的PriorShift的SDE。

 

DeqIR

Deep Equilibrium Diffusion Restoration with Parallel Sampling

  1. DEQ-based

 

Blind

BlindDPS

Parallel Diffusion Models of Operator and Image for Blind Inverse Problems

 

GDP

Generative Diffusion Prior for Unified Image Restoration and Enhancement

GDP

 

BIRD

Blind Image Restoration via Fast Diffusion Inversion

BIRD

  1. 类似DMPlug,we aim to find the initial noise sample that can generate the image when applied to DDIM.

  2. η is a vector with all the parameters that define the degradation operator HLIR=yHη(x0)2ηxT一起优化。

 

FlowIE

FlowIE: Efficient Image Enhancement via Rectified Flow

FlowIE

  1. 直接使用Flow建模两个分布间的path,可以适用于多种任务,如inpainting,colorization和super resolution。

 

AutoDIR

AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion

  1. 训练一个能handle different degradations的image restoration网络。

  2. 训练一个网络识别输入图像属于哪种预定义的degradation(如blur),填入某个template形成prompt(如"a photo needs {blur} artifact reduction")。

  3. 使用多种预定义的degradation的数据进行训练一个LDM,原图concat在zt上,prompt作为条件,训练复原原数据。

  4. 使用时,将原图输入2,得到prompt,再一起输入LDM进行restoration。

 

DiffBIR

DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior

DiffBIR

 

PromptIR

PromptIR: Prompting for All-in-One Blind Image Restoration

PromptIR

 

ZeroAIR

Exploiting Diffusion Priors for All-in-One Image Restoration

 

Diff-Plugin

Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks

  1. 类似AutoDIR

 

TIP

TIP: Text-Driven Image Processing with Semantic and Restoration Instructions

  1. ControlNet,ControlNet输入degration指令,StableDiffusion输入prompt,自监督训练。

 

Decorruptor

Efficient Diffusion-Driven Corruption Editor for Test-Time Adaptation

Decorruptor

  1. create pairs of (clean, corrupted) images and utilize them for fine-tuning to enable the recovery of corrupted images to their clean states.

 

PromptFix

PromptFix: You Prompt and We Fix the Photo

PromptFix

  1. We compile approximately two million raw data points across eight tasks: image inpainting, object creation, image dehazing, image colorization, super-resolution, low-light enhancement, snow removal, and watermark removal. For each low-level task, we utilized GPT-4 to generate diverse training instruction prompts Pinstruction. These prompts include task-specific and general instructions. The task-specific prompts, exceeding 250 entries, clearly define the task objectives. For example, "Improve the visibility of the image by reducing haze" for dehazing.

  2. For watermark removal, super-resolution, image dehazing, snow removal, low-light enhancement, and iimage colorization tasks, we also generate "auxiliary prompts" for each instance. These auxiliary prompts describe the quality issues for the input image and provide semantic captions.

 

SUPIR

Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild

借助MLLM生成prompt,ControlNet引入LQ,送入SDXL生成HQ。

 

Face/Human

PGDiff

PGDiff: Guiding Diffusion Models for Versatile Face Restoration via Partial Guidance

  1. partial guidance,类似GradPaint,整理了很多任务使用统一框架。

 

PFStorer

PFStorer: Personalized Face Restoration and Super-Resolution

PFStorer

  1. 有reference image的restoration,LQ和StableSR引入方式一样,reference image以类似ControlNet的方式引入。

 

CLR-Face

CLR-Face: Conditional Latent Refinement for Blind Face Restoration Using Score-Based Diffusion Models

 

DiffBody

DiffBody: Human Body Restoration by Imagining with Generative Diffusion Prior

DiffBody

  1. ControlNet

 

 

Super Resolution

将LR图像上采样到HR的resolution,该问题就可以转化为LQ到HQ的restoration问题。

 

SRDiff

SRDiff: Single Image Super-Resolution with Diffusion Probabilistic Models

SRDiff

  1. diffusion model以LR为condition建模HR与upsample(LR)之间的residual。

 

SR3

Image Super-Resolution via Iterative Refinement

低分辨率图像上采样到高分辨率,concat在xt​上进行训练,和GLIDE的inpainting model类似。

 

StableSR

Exploiting Diffusion Prior for Real-World Image Super-Resolution

StableSR

  1. LR上采样到HR的resolution,经过VAE encoder编码后,输入一个可训练的time-aware encoder,得到multi-scale feature,再训练一个小卷积网络(SFT),根据feature预测scale and shift去affine StableDiffusion对应的feature,只训练encoder和SFT。

  2. color correction:预测结果每个通道减去自己的均值再除以自己的标准差,之后乘以LR在该通道的标准差再加上均值。

  3. 训练一个CFW模块,利用VAE encoder的feature Fe,修改VAE decoder的feature Fd,MSE训练,只训练CFW。

 

ResShift

ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting

super resolution,扩散过程以HR为起点LR为终点,不断增加LR和HR的residual,类似ShiftDDPMs推导后验公式,建模逆向过程。

 

SinSR

SinSR: Diffusion-Based Image Super-Resolution in a Single Step

将ResShift扩展到DDIM的确定性采样,之后蒸馏为一步。

 

PatchScaler

PatchScaler: An Efficient Patch-independent Diffusion Model for Super-Resolution

PatchScaler

  1. confidence-driven loss:LGRM=EyLR[yHRxHR12+λ(CyHRxHR22ηlogC)]​。xHR是ground truth HR feature。

  2. 使用xHR训练一个DiT。

  3. GRM得到coarse HR feature yHR后,对其进行patchify,根据每个patch内像素的平均confidence score决定该patch的难易程度,根据难易程度选择一个时间步t(越难越大),加噪到yt,使用DiT去噪,类似SDEdit。

 

Treg

Regularization by Texts for Latent Diffusion Inverse Solvers

使用text引导的超分和去模糊。

 

PromptSR

Image Super-Resolution with Text Prompt Diffusion

LR上采样后concat到xt上重新训练一个带cross-attention的diffusion model,使用预训练text encoder对prompt进行编码送入cross-attention,prompt都是一些指令,如deblur,resize等。

Text-guided Explorable Image Super-resolution

 

CoSeR

CoSeR: Bridging Image and Language for Cognitive Super-Resolution

类似PromptSR,根据LR生成一些粗略的HR的reference image和prompt,两者作为条件训练diffusion model进行超分生成。

 

CasSR

CasSR: Activating Image Power for Real-World Image Super-Resolution

CasSR

根据LR生成一些粗略的HR的reference image,和LR一起作为条件训练diffusion model进行超分生成。

 

SeeSR

SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution

SeeSR

 

PASD

Pixel-Aware Stable Diffusion for Realistic Image Super-Resolution and Personalized Stylization

PASD

 

XPSR

XPSR: Cross-modal Priors for Diffusion-based Image Super-Resolution

XPSR

类似SUPIR借助MLLM生成prompt。

 

SAM-DiffSR

SAM-DiffSR: Structure-Modulated Diffusion Model for Image Super-Resolution

SAM辅助。

 

SkipDiff

SkipDiff: Adaptive Skip Diffusion Model for High-Fidelity Perceptual Image Super-resolution

SkipDiff

  1. action 0 is to perform the reverse diffusion process with the current state, while action 1 is to skip the diffusion process.

 

ECDP

Effcient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolution

ECDP

  1. 使用两个loss重新训练一个分数模型。

  2. Lscore即为diffusion loss。

  3. 每次训练时,先根据LR用分数模型的PF ODE采样得到结果,与HR计算perceptual loss,即为Lquality,ODE不需要存中间结果也可以反向传播(Neural Ordinary Differential Equations)。

 

FDDif

Frequency-Domain Refinement with Multiscale Diffusion for Super Resolution

 

BlindDiff

BlindDiff: Empowering Degradation Modelling in Diffusion Models for Blind Image Super-Resolution

  1. most methods are tailored to solving non-blind inverse problems with fixed known degradation settings, limiting their adapt ability to real-world applications that involve complex unknown degradations.

  2. 引入对degradation level的估计。

 

CDFormer

CDFormer: When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution

  1. Blind Image Super-Resolution

 

Inpainting

Blended Diffusion

Blended Diffusion for Text-driven Editing of Natural Images

Blended Latent Diffusion

  1. training-free,text-free + text-guided

  2. pre-trained unoncditional diffusion model + pre-trained CLIP as guidance。

  3. 类似于inpainting,每一步采样结果的unmask部分用q(xt|x0)取代。

  4. extending augmentations

 

LatentPaint

LatentPaint: Image Inpainting in Latent Space with Diffusion Models

  1. training-free,text-free

  2. 对latent representation(比如h-space)做blended

 

RePaint

RePaint: Inpainting using Denoising Diffusion Probabilistic Models

resample

  1. training-free,text-free

  2. resample technique

 

CoPaint

Towards Coherent Image Inpainting Using Denoising Diffusion Implicit Models

  1. training-free,text-free

  2. 对随机DDIM每步采样结果进行梯度更新,loss为其x^0​和原图的unmasked region之间的MSE。

  3. 同样使用resample technique。

 

GradPaint

GradPaint: Gradient-Guided Inpainting with Diffusion Models

  1. training-free,text-free

  2. CoPaint的梯度guidance版。计算每步生成结果和原图的unmasked region之间的MSE,求梯度作为guidance,类似Posterior Sampling。

 

Tiramisu

Image Inpainting via Tractable Steering of Diffusion Models

Tractable Probabilistic Models

 

GLIDE

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

  1. training-based,text-guided

  2. 参考Text-Guided Inpainting Model。

 

Imagenator

Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting

  1. training-based,text-guided

  2. Imagen版本的GLIDE的Text-Guided Inpainting Model,直接降采样并concat会导致mask边缘不和谐,所以训练一个encoder进行降采样。

Imagenator

 

StableInpainting

High-Resolution Image Synthesis with Latent Diffusion Models

  1. training-based,text-guided

  2. StableDiffusion版本的GLIDE的Text-Guided Inpainting Model,在LAION上使用随机mask进行训练,masked image也用VAE encoder编码,mask降采样到和zt​一样的尺寸。

 

SmartBrush

SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model

  1. training-based,text-guided

  2. xt中只有前景被加噪,背景仍然是原图。

  3. self supervised learning using panoptic segmentation dataset

  4. mask augmentation + background preservation with mask prediction

  5. 编辑时还可以通过mask指定shape。

SmartBrush

 

PowerPaint

A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

  1. training-based,text-guided

  2. 和StableInpainting一样的训练方法,额外在文本中引入可训练的prompt,作为该任务的prompt。

PowerPaint

 

ControlNet-Inpainting

Adding Conditional Control to Text-to-Image Diffusion Models

  1. training-based,text-guided

  2. zt + masked image + mask作为condition输入ControlNet。

 

BrushNet

BrushNet: A Plug-and-Play Image Inpainting Model with Decomposed Dual-Branch Diffusion

  1. training-based,text-guided

  2. ControlNet改进版,新加的网络去掉了cross-attention layer,只处理图像。

BrushNet

 

Brush2Prompt

Brush2Prompt: Contextual Prompt Generator for Object Inpainting

  1. 根据unmask部分的内容及mask的形状,自动生成用于inpainting的prompt,之后使用text-guided inpainting model进行inpainting。

 

LoMOE

LoMOE: Localized Multi-Object Editing via Multi-Diffusion

LoMOE

  1. training-free,text-guided。

  2. 使用BLIP生成图像的prompt,使用regularized DDIM Inversion得到xT​。

  3. 因为是多区域编辑,所以使用基于mask的MultiDiffusion,每个区域使用自己的edit prompt单独去噪一次,然后根据mask加在一起。

  4. 经典的two branch方法,两个branch之间计算loss梯度下降优化yt,cross-attention map之间的MSE loss用于保持编辑物体的位置和结构,background pixel MSE loss保持背景。

 

HD-Painter

HD-Painter: High-Resolution and Prompt-Faithful Text-Guided Image Inpainting with Diffusion Models

  1. 基于StableInpainting的training-free方法,text-guided。

  2. 使用上述训练好的StableInpainting模型,将所有self-attention layer替换为Prompt-Aware Introverted Attention layer(PAIntA layer),其也是self-attention的计算方式,但对每个masked pixel对应的self-attention map做修改,给masked pixel对应的self-attention map中的unmasked pixel的响应值乘一个系数,该系数等于该unmasked pixel与所有word的cross-attention map的响应值的和,目的是让masked pixel更加注重那些与text有关的unmasked pixel。由于StableInpainting中所有self-attention layer(即PAIntA layer)都在cross-attention layer前,所以计算时借用下一个cross-attention layer的参数。

  3. Reweighting Attention Score Guidance:计算每个word对应的cross-attention map,根据mask计算交叉熵, maximize the cross-attention scores in the masked region and minimize the cross-attention scores in the unmasked region,所有word的计算结果求和,求梯度作为guidance。一般的guidance会使采样结果偏离,导致采样质量下降,这里将guidance除以其标准差,将随机版本的DDIM采样公式中的噪声替换为guidance,因为随机版本的DDIM采样公式是可以保证采样结果不偏离的,因为其噪声是标准正态分布,所以这里将guidance除以其标准差以匹配单位方差,但保持了其均值以实现guidance。

  4. 训练一个超分LDM,对上述inpainting结果进行超分。

 

MagicRemover

MagicRemover: Tuning-free Text-guided Image inpainting with Diffusion Models

MagicRemover

  1. training-free,text-guided,是专门做object removal的,text为想要remove的object。

  2. optimizing zt towards the direction where the cross-attention map response of the k-th word (swan) is zero naturally leads to the erasure of the object corresponding to the k-th word. cross-attention map响应值由高到低分别对应物体、影子和背景,定义函数g(t,k,λ)=CAMt,k[min(CAMt,k)+λ(max(CAMt,k)min(CAMt,k))]CAMt,k1,并使用CAMt,kztg(t,k,λ)作为guidance,使用h-space的非对称采样方法。

  3. 将reconstructive generative trajectory的self-attention的KV注入inpainting generative trajectory,但可以使用MasaCtrl的思路,使用reconstructive generative trajectory的cross-attention估计一个object的mask,让inpainting generative trajectory的self-attention中的object区域只参考reconstructive generative trajectory的KV中mask之外的区域。

 

Uni-paint

Uni-paint: A Unified Framework for Multimodal Image Inpainting with Pretrained Diffusion Model

UniPaint

  1. training-based,text-free + text-guided

  2. We found blended is insufficient since the known information is inserted externally rather than generated by the model itself, the model lacks full context awareness, potentially causing incoherent semantic transitions near hole boundary. 只需要masked finetune一下模型,继续使用blended生成。

  3. 进一步才用masked attention,对于cross-attention,只让text和masked区域内的pixel做attention,对于self-attention,只让masked区域内的pixel之间互相做attention。

 

MaGIC

Multi-modality Guided Image Completion

  1. training-based,text-based,StableInpainting Model。

  2. 每个模态训练一个encoder提取模态的feature,feature是multi-scale的,每个scale注入到UNet的encoder对应scale的feature上。对于structure-form(如segmentation,edge等),直接相加;对于context-form(如text,style等),将feature进行pool后注入cross-attention作为context vector。训练时freeze StableDiffusion Inpainting Model,只训练模态encoder,且每个模态单独训练。有点类似ControlNet。

  3. 采样时可以多个模态encoder一起用,不过不能再用上面的注入方式了(因为feature不具备可加性),而是使用StableDiffusion Inpainting Model的UNet的multi-scale feature和引入单个模态encoder后得到的multi-scale feature计算MSE loss,求梯度作为guidance,因为梯度具有可加性,这样可以实现多模态的guidance而不用重新训练多模态。

 

Inpaint Anything

Inpaint Anything: Segment Anything Meets Image Inpainting

SAM + 任意inpainting model

 

StrDiffusion

Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting

StrDiffusion

  1. training-based

  2. IR-SDE的公式,masked image作为μ

  3. sparse structure:例如the grayscale map和edge map。

 

ByteEdit

ByteEdit: Boost, Comply and Accelerate Generative Image Editing

ByteEdit

  1. StableInpainting with feedback learning

  2. 假设diffusion model使用20步生成,训练时只加噪到15步,之后在[1,10]之间随机采样一个时间步t,不记录梯度从15生成到t,之后记录梯度从t生成到0,优化从t生成到0的生成链。

 

SketchInpainting

Sketch-guided Image Inpainting with Partial Discrete Diffusion Process

SketchInpainting

  1. 只对mask部分的token进行discrete diffusion,构造数据集进行自监督训练。

 

LazyDiffusion

Lazy Diffusion Transformer for Interactive Image Editing

LazyDiffusion

  1. 使用Pixel-α​的patch方法对masked image进行patchify,输入一个transfomer context encoder,之后只保留mask覆盖区域的token作为global context。

  2. 使用Pixel-α初始化,对原图加噪并patchify,只将mask覆盖区域的token输入模型,global context token concat在输入上,prompt输入cross-attention,端到端训练所有参数进行去噪。

  3. 类似SmartBrush构造数据集进行自监督训练。

 

Outpainting

Ten

Generative Powers of Ten

zoom stack

 

PQDiff

Continuous-Multiple Image Outpainting in One-Step via Positional Query and A Diffusion-based Approach

PQDiff

  1. 随机crop两个view,一个archor view一个target view,resize到相同shape,用左上角坐标计算RPE,以archor view为条件训练生成target view。

 

PBG

Salient Object-Aware Background Generation using Text-Guided Diffusion Models

PBG

  1. We use Stable Inpainting as a base model and add the ControlNet model on top to adapt it to the salient object outpainting task.

 

Representation Learning

Diff-AE

Diffusion Autoencoders: Toward a Meaningful and Decodable Representation

 

SODA

SODA: Bottleneck Diffusion Models for Representation Learning

SODA

UNet有2m+1层,将z分为m个部分{zi}i=1m,Adaptive GroupNorm时,encoder和decoder相同resolution使用相同的zi,同时随机zero out {zi}i=1m的一部分,可以classifier-free guidance生成,也能提高z之间的解耦性。

 

PDAE

Unsupervised Representation Learning from Pre-trained Diffusion Probabilistic Models

 

DBAE

Diffusion Bridge AutoEncoders for Unsupervised Representation Learning

DBAE

  1. 编码器编码x0,得到z,根据z解码得到xT,利用DDBM建模x0xT之间的bridge。

  2. 相比于Diff-AE和PDAE,数据的信息分别存储与zxT,但DBAE的xT完全依赖于z,因此所有信息都存储于z,使用确定性采样算法时不存在stochastic variations现象,另外DBAE的xT的推理速度也更快。

 

HDAE

Hierarchical Diffusion Autoencoders and Disentangled Image Manipulation

 

DiffuseGAE

DiffuseGAE: Controllable and High-fidelity Image Manipulation from Disentangled Representation

在Diff-AE的隐空间上学习解耦表征。

 

DisDiff

DisDiff: Unsupervised Disentanglement of Diffusion Probabilistic Models

 

EncDiff

Diffusion Model with Cross Attention as an Inductive Bias for Disentanglement

 

CL-Dis

Closed-Loop Unsupervised Representation Disentanglement with beta-VAE Distillation and Diffusion Probabilistic Feedback

 

FDAE

Factorized Diffusion Autoencoder for Unsupervised Disentangled Representation Learning

 

DiTi

Exploring Diffusion Time-steps for Unsupervised Representation Learning

 

CausalDiffAE

Causal Diffusion Autoencoders: Toward Counterfactual Generation via Diffusion Probabilistic Models

 

Object-Centric Learning

Object-Centric Slot Diffusion

Learning to Compose: Improving Object Centric Learning by Injecting Compositionality

 

DDAE as Self-supervised Learners

Denoising Diffusion Autoencoders are Unified Self-supervised Learners

 

DiffMAE

Diffusion Models as Masked Autoencoders

 

MDM

Masked Diffusion as Self-supervised Representation Learner

MDM

动态mask ratio版的MAE。

 

StableRep

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

根据文本生成图像作为数据。

Can Generative Models Improve Self-Supervised Representation Learning?

根据原图生成图像作为数据,instance-guided generation作为一种augmentation进行SSL。

Unlike StableRep, we do not replace a real dataset with a synthetic one. Instead, we leverage conditional generative models to enrich augmentations for self-supervised learning. In addition, our method does not require text prompts and directly uses images as input to the generative model.

 

GenPoCCL

Multi Positive Contrastive Learning with Pose-Consistent Generated Images

GenPoCCL

 

GenView

GenView: Enhancing View Quality with Pretrained Generative Model for Self-Supervised Learning

GenView

 

SynCLR-SynCLIP

Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images

SynCLR-SynCLIP

 

DreamDA

DreamDA: Generative Data Augmentation with Diffusion Models

DreamDA

给h-space的feature加一个高斯噪声用于预测x^0,原来的feature用于预测direction,DDIM采样,可以生成原图的variations。

 

l-DAE

Deconstructing Denoising Diffusion Models for Self-Supervised Learning

 

ADDP

ADDP: Learning General Representations for Image Recognition and Generation with Alternating Denoising Diffusion Process

 

InfoDiffusion

InfoDiffusion: Representation Learning Using Information Maximizing Diffusion Models

 

RepFusion

Diffusion Model as Representation Learner

Distill the intermediate representation from a pre-trained diffusion model to a recognition student.

After the distillation phase, the student is reapplied as a feature extractor and fine-tuned with the task label.

Reinforced Time Selection for Distillation.

 

De-Diffusion

De-Diffusion Makes Text a Strong Cross-Modal Interface

text as representation, encoder is a captioning model, decoder is a text2img model

gumbel softmax

 

DiffSSL

Do text-free diffusion models learn discriminative visual representations?

利用UNet intermediate feature maps做判别。

 

Other Tasks

hybrid

InstructDiffusion

InstructDiffusion: A Generalist Modeling Interface for Vision Tasks

不同任务转变为不同instruction,原图、instruction、目标图像作为数据,instruction作为文本输入,训练一个StableDiffusion生成目标图像,原图concat到zt上。

利用InstructPix2Pix的数据训练也可以做编辑。

 

InstructCV

InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists

InstructCV

instruction作为StableDiffusion的文本输入,原图x编码后concat在zt上进行训练。

 

Object Detection

DiffusionDet

DiffusionDet: Diffusion Model for Object Detection

 

CamoDiffusion

CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion Models

 

DiffRef3D

DiffRef3D: A Diffusion-based Proposal Refinement Framework for 3D Object

 

DiffuBox

DiffuBox: Refining 3D Object Detection with Point Diffusion

 

MonoDiff

Monocular: 3D Object Detection and Pose Estimation with Diffusion Models

 

SDDGR

SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection

 

Edge Detection

DiffusionEdge

DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection

 

Correspondence

DiffMatch

Diffusion Model for Dense Matching

 

Caption

CLIP-Diffusion-LM

Apply Diffusion Model on Image Captioning

 

DiffCap

DiffCap: Exploring Continuous Diffusion on Image Captioning

 

Text-only Image Captioning

Text-Only Image Captioning with Multi-Context Data Generation

 

Prefix-Diffusion

Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning

 

LaDiC

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

LaDiC

 

Visual Grounding

PVD

Parallel Vertex Diffusion for Unified Visual Grounding

 

DiffusionVG

Language-Guided Diffusion Model for Visual Grounding

 

DiffusionVG

Exploring Iterative Refinement with Diffusion Models for Video Grounding

 

Visual Prediction

DDP

DDP: Diffusion Model for Dense Visual Prediction

 

Action Anticipation

DIFFANT

DIFFANT: Diffusion Models for Action Anticipation

 

Amodal Segmentation

pix2gestalt

pix2gestalt: Amodal Segmentation by Synthesizing Wholes

 

Segmentation

DFormer

DFormer: Diffusion-guided Transformer for Universal Image Segmentation

 

OVDiff

Diffusion Models for Zero-Shot Open-Vocabulary Segmentation

 

GCDP

Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis

  1. image和segmentation在channel维拼成一条数据训练text-guided diffusion model,使用Gaussian-Categorical distribution新公式,可以根据text同时生成image和segmentation的互相生成。

 

SemFlow

SemFlow: Binding Semantic Segmentation and Image Synthesis via Rectified Flow

  1. We train an ordinary differential equation (ODE) model to transport between the distributions of real images and semantic masks.

 

UniGS

UniGS: Unified Representation for Image Generation and Segmentation

 

DiffDASS

Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation

Domain Adaptive Semantic Segmentation,利用image translation做分割模型的迁移。

需要source domain的图像和分割图,target domain的图像。

使用source domain的图像和分割图训练一个分割模型,使用target domain的图像训练一个扩散模型,对source domain的图像做SDEdit,分割模型和分割图计算loss做梯度修正,生成该分割图在target domain对应的图像,然后使用它们fine-tune source domain的分割模型,就可以得到target domain的分割模型。

 

DGInStyle

DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

 

LDMSeg

A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask Inpainting

LDMSeg

Depth

D4RD

Digging into contrastive learning for robust depth estimation with diffusion models

 

Optical Flow

FlowDiffuser

FlowDiffuser: Advancing Optical Flow Estimation with Diffusion Models

 

Retrieval

DiffusionRet

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

 

Temporal Action Detection

DiffTAD

DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion

 

Object Tracking

DiffusionTrack

DiffusionTrack: Diffusion Model For Multi-Object Tracking

 

DiffusionTrack

DiffusionTrack: Point Set Diffusion Model for Visual Object Tracking

 

Video Moment Retrieval

MomentDiff

MomentDiff: Generative Video Moment Retrieval from Random to Real

 

Sound Event Detection

DiffSED

DiffSED: Sound Event Detection with Denoising Diffusion

 

Knowledge Distillation

DM-KD

Is Synthetic Data From Diffusion Models Ready for Knowledge Distillation?

用diffusion model生成的数据作为训练集,输入到pre-trained teacher网络,蒸馏到student网络。这样不需要限制在真实数据集上,效果好,用生成的low fidelity的图像(减少采样步数等方法)效果更好。

 

DiffKD

Knowledge Diffusion for Distillation

用teacher网络提取到的feature作为数据训练一个diffusion model,将student网络提取到的feature作为teacher的feature的noisy version进行去噪,去噪后的feature和teacher的feature计算KL loss,优化student网络。

 

Classification

RDC

Robust Classification via a Single Diffusion Model

miny1Tt=1TE[ϵθ(xt,t,y)ϵ22]

 

TiF

Few-shot Learner Parameterization by Diffusion Time-steps

在few-shot dataset上LoRA fine-tune StableDiffusion,prompt用"a photo of [C]",使用类似上述RDC的公式推理,但给公式引入了一个时间步weight,并指明这个weight很重要。

 

CiP

Image Captions are Natural Prompts for Text-to-Image Models

对于只有类别标注的图像数据集,如ImageNet,利用预训练caption模型,对某个图像生成一个caption,拼在"a photo of class"之后组成一个prompt,再利用预训练StableDiffusion生成这个prompt的图像,用这个生成的图像替代原图,组成数据集。最终合成的数据集与原数据集大小相同。用合成的数据集训练分类器,效果更好。

 

 

Data Attribution

Diffusion Attribution

For a generated image, which training data contribute to it much?

Evaluating Data Attribution for Text-to-Image Models

Intriguing Properties of Data Attribution on Diffusion Models

Detecting Image Attribution for Text-to-Image Diffusion Models in RGB and Beyond

 

Dataset Distillation

LD3M

Latent Dataset Distillation with Diffusion Models

Dataset distillation aims to generate a small set of representative synthetic samples from the original training set.

 

D4M

D4M: Dataset Distillation via Disentangled Diffusion Model

 

OOD

DiffGuard

DiffGuard: Semantic Mismatch-Guided Out-of-Distribution Detection using Pre-trained Diffusion Models

DIFFGUARD

 

NODI

NODI: Out-Of-Distribution Detection with Noise from Diffusion

 

DiffPath

Out-of-Distribution Detection with a Single Unconditional Diffusion Model

 

Image Quality Assessment

PFD-IQA

Feature Denoising Diffusion Model for Blind Image Quality Assessment

 

eDifFIQA

eDifFIQA: Towards Efficient Face Image Quality Assessment Based On Denoising Diffusion Probabilistic Models

 

DP-IQA

DP-IQA: Utilizing Diffusion Prior for Blind Image Quality Assessment in the Wild

 

NR-IQA

Comparison of No-Reference Image Quality Models via MAP Estimation in Diffusion Latents

 

Generative Understanding

利用预训练生成模型网络的feature辅助理解模型,或者使用生成数据提升理解模型。

 

hybrid

DatasetDM

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models

DatasetDM

 

DMaaPx

Upgrading VAE Training With Unlimited Data Plans Provided by Diffusion Models

 

DMP

Exploiting Diffusion Prior for Generalizable Pixel-Level Semantic Prediction

 

Syn-Rep-Learn

Scaling Laws of Synthetic Images for Model Training

 

Vermouth

Bridging Generative and Discriminative Models for Unified Visual Perception with Diffusion Priors

Vermouth

To effectively transfer learned features to discriminative tasks while ensuring compatibility, an intuitive approach is to introduce the prior knowledge of the recognition model. 使用预训练好的ResNet-18引入判别先验Fexp,ResNet本身就能生成多分辨率的feature,和UNet生成的feature可以concat在一起。

U-head有两个flow,down-sample flow生成global feature,用于分类等任务,up-sample flow生成spatial feature,用于分割等任务。

 

GenPercept

Diffusion Models Trained with Large Data Are Transferable Visual Models

We show that, simply initializing image understanding models using a pre-trained UNet (or transformer) of diffusion models, it is possible to achieve remarkable transferable performance on fundamental vision perception tasks using a moderate amount of target data.

使用预训练好的diffusion model,输入原图,时间步为1,fine-tune diffusion model预测target,如depth等。

 

Classification

Diffusion Classification

Diffusion Models Beat GANs on Image Classification

UNet feature + classification head

 

FGDS

Feedback-Guided Data Synthesis for Imbalanced Classification

FGDS

 

Analyzing and Explaining Image Classifiers via Diffusion Guidance

 

Diversify, Don’t Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

 

Active Generation for Image Classification

 

Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model

 

Efficient Exploration of Image Classifier Failures with Bayesian Optimization and Text-to-Image Models

 

Image Retrieval

Zero-Shot Sketch-based Image Retrieval

Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers

 

Sketch

DiffSketch

Representative Feature Extraction During Diffusion Process for Sketch Extraction with One Example

 

Object Detection

DiffusionEngine

DiffusionEngine: Diffusion Model is Scalable Data Engine for Object Detection

  1. 利用StableDiffusion制造detection数据,与Attention as Annotation类似。

  2. 先用已有的detection数据只加一步噪声输入StableDiffusion,训练一个可以根据UNet feature map pyramid生成bounding box的Detection Adaptor。之后固定Detection Adaptor,构造一些简单通用的prompt,给已有的detection数据图片加噪声再生成(类似SDEdit),将最后一步的feature map pyramid输入Detection Adaptor,输出作为生成图片的bounding box标注。

 

T2I-for-Detection

Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation

利用StableDiffusion制造detection数据,方法是分别生成前景和背景,再拼接粘合。

 

Data Augmentation for Object Detection via Controllable Diffusion Models

 

Learning Compositional Language-based Object Detection with Diffusion-based Synthetic Data

 

3DiffTection

3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features

 

DetDiffusion

DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception

 

Segmentation

DDPMSeg

Label-Efficient Semantic Segmentation With Diffusion Models

  1. 对数据进行加噪,输入到预训练好的DDPM的UNet中,用decoder各层输出的feature map上采样到图片尺寸后concat起来,每个pixel对应一个vector,将其输入到MLP中进行标签预测,进行训练。

  2. 经过实验,选取B=5,6,7,8,12层decoder输出的feature map和t=50,150,250的加噪数据,全部concat在一起,并训练多个独立的MLP,预测时采取投票制决定分类。

 

EmerDiff

EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models

we leverage the semantic knowledge extracted from Stable Diffusion (SD) and aim to develop an image segmentor capable of generatin fine-grained segmentation maps without any additional training.

 

MaskDiffusion

MaskDiffusion: Exploiting Pre-trained Diffusion Models for Semantic Segmentation

MaskDiffusion-seg

 

OVAM

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models

OVAM-1

  1. 选多个时间步的不同层的attribution prompt的cross-attention map估计segmentation。

OVAM-2

  1. attribution prompt不一定是最好的描述,借用TI的思想,可以用一些数据做token optimization,即优化attribution prompt的token embedding,效果很更好。

 

FreeSeg-Diff

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

FreeSeg-Diff

  1. training-free

 

DatasetDiffusion

Dataset Diffusion: Diffusion-based Synthetic Dataset Generation for Pixel-Level Semantic Segmentation

  1. AsτAcτ是一个指数,用于增强self-attention map,之后使用self-attention map增强cross-attention map,类似卷积,把每个像素的self-attention map(HW)看作一个卷积核,cross-attention map看作feature,当某个像素的self-attention map和cross-attention map有相同的较高响应值的区域,那么AsτAc中该像素就有较高的结果。

 

DIFF

Diffusion Features to Bridge Domain Gap for Semantic Segmentation

DIFF

 

VPD

Unleashing Text-to-Image Diffusion Models for Visual Perception

cross-attention map

 

EVP

EVP: Enhanced Visual Perception using Inverse Multi-Attentive Feature Refinement and Regularized Image-Text Alignment

enhanced VPD

 

Meta-Prompt

Harnessing Diffusion Models for Visual Perception with Meta Prompts

 

ODISE

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

ODISE

 

DiffSeg

Diffuse, Attend, and Segment Unsupervised Zero-Shot Segmentation using Stable Diffusion

DiffSeg utilizes a pre-trained StableDiffusion model and specifically its self-attention layers to produce high quality segmentation masks.

 

DiffSegmenter

Diffusion Model is Secretly a Training-free Open Vocabulary Semantic Segmenter

设计prompt,和图像一起送进StableDiffusion,利用某个单词的cross-attention map得到该物体的大概的分割图,再利用self-attention map进行调整和补全。

 

Attention as Annotation

Attention as Annotation: Generating Images and Pseudo-masks for Weakly Supervised Semantic Segmentation with Diffusion

不再依赖人工标注,利用StableDiffusion生成大量图像和分割图(cross-attention map)的数据训练分割模型

 

SegGen

SegGen: Supercharging Segmentation Models with Text2Mask and Mask2Img Synthesis

  1. 使用现有的分割数据集(image, segmentation map),将segmentation map编码成三通道的图像的形式,使用BLIP2得到image的caption,fine-tune SDXL(caption segmentation map),得到一个Text2Mask模型。

  2. 使用(image, segmentation map)训练一个ControlNet,得到一个Mask2Img模型。

  3. 之后可以使用这两个网络生成新的分割训练数据:使用现有分割数据集的某张原图,使用BLIP2得到原图的caption,输入到Text2Mask模型中,得到一系列segmentation map,再输入到Mask2Img模型,得到segmentation map对应的原图,组成数据对。

  4. 对于相同的分割模型,在现有的分割数据集基础上,额外使用生成的数据进行训练,效果比只使用现有的分割数据集的模型有明显提升。

 

FoBaDiffusion

Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models

 

Scribble-Supervised Semantic Segmentation

Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation

用scribble训练一个ControlNet,生成分割训练数据数据。

 

Outline

Outline-Guided Object Inpainting with Diffusion Models

利用少量的instance segmentation数据,使用StableInpainting对这些数据的做object variation,扩增数据。

 

Ref LDM-Seg

Explore In-Context Segmentation via Latent Diffusion Models

 

ScribbleGen

ScribbleGen: Generative Data Augmentation Improves Scribble-supervised Semantic Segmentation

ScribbleGen

 

Grounding

Peekaboo

Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors

给定一张图片和一句描述图片中某个object的text,利用预训练好的text2img模型StableDiffusion预测目标object的mask。

 

Grounded Diffusion

Guiding Text-to-Image Diffusion Model Towards Grounded Generation

利用预训练好的text2img模型StableDiffusion,在根据text输出image的同时,还会输出image对应的segmentation mask。

先用StableDiffusion生成图片,再用预训练好的object detector生成这些图片的segmentation mask,构建了一个数据集,再使用这个数据集训练grounding module,方法也类似Label-Efficient Semantic Segmentation With Diffusion Models。

grounding

GenPromp

Generative Prompt Model for Weakly Supervised Object Localization

 

 

Semantic Correspondence

SD complements DINO

A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence

exploit Stable Diffusion features for semantic and dense correspondence

 

DIFT

Emergent Correspondence from Image Diffusion

不需要训练,直接使用Stable Diffusion features做匹配即可。

 

SD4Match

SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching

SD4Match

  1. prompt tuning

 

Diffusion-Hyperfeatures

Diffusion Hyperfeatures: Searching Through Time and Space for Semantic Correspondence

Diffusion-Hyperfeatures

  1. for a given feature map r we upsample it to a standard resolution, pass through a bottleneck layer B to a standard channel count, and weight it with a mixing weight ω. 最终的descriptor map为s=0Sl=1Lωs,lBl(rs,l),其中S是DDIM generation或Inversion的步数,L为UNet层数,Bl是所有时间步共用的可训练网络,ωs,l是可训练的weight。

  2. DDIM generation和Inversion效果相似,所以既适用于synthetic images也适用于real images。

  3. For semantic correspondence, we flatten the descriptor maps for a pair of images and compute the cosine similarity between every possible pair of points. We then supervise with the labeled corresponding keypoints using a symmetric cross entropy loss in the same fashion as CLIP.

 

Depth and Saliency

Diffusion Scene Representation

Beyond Surface Statistics: Scene Representations in a Latent Diffusion Model

  1. 用预训练网络标注StableDiffusion生成图像的depth and salient数据集。

  2. Extract the intermediate output of some self-attention layer at some sampling step. Interpolate lower resolution predictions to the size of synthesized images. A linear classifier is trained on it to predict the pixel-level logits.

  3. 之后就可以使用StableDiffusion和linear classifier进行预测。

 

JointNet

JointNet: Extending Text-to-Image Diffusion for Dense Distribution Modeling

JointNet

  1. a novel neural network architecture for modeling the joint distribution of images and an additional dense modality (e.g., depth maps).

 

ECoDepth

ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

ECoDepth

 

Multi-Object Tracking

TrackDiffusion

TrackDiffusion: Multi-object Tracking Data Generation via Diffusion Models

使用layout-to-image model,根据tracklet生成video sequence作为训练MOT的数据。

 

DiffMOT

DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction

 

Unifying Generative and Understanding

EGC

EGC: Image Generation and Classification via a Diffusion Energy-Based Model

energy function,需要求二阶导优化。

类似Denoising Likelihood Score Matching for Conditional Score-based Data Generation

 

DiffDis

DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

Unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process.

 

Factorized Diffusion

Factorized Diffusion Architectures for Unsupervised Image Generation and Segmentation

类似MAGE的生成与理解统一的模型。

复制一个UNet decoder作为Mask Generator,每一步生成K个mask,每个mask代表一个segmentation区域,每个mask乘在encoder的skip-connection feature上,原本的UNet decoder再根据masked skip-connection feature输出K个predicted noise,再乘上各自的mask,K个masked predicted noise的和作为最终的predicted noise计算diffusion loss,这样生成时可以图像和分割图一起生成。

也可以做real image segmentation,只需要加噪一步再去噪一步即可。

 

 

Other Interesting Paper

UnseenDiffusion

Unseen Image Synthesis with Diffusion Models

使用某个域内预训练的diffusion model生成域外的样本。

DDIM inverse 2k个OOD样本到500步得到2k个x500,计算均值和方差,从这个高斯分布中采样,进行生成,可以得到OOD样本。

 

IMPUS

IMPUS: Image Morphing with Perceptually-Uniform Sampling Using Diffusion Models

目标:给定两张图像,做插值。

  1. SD上分别作TI,得到两张图对应的text embedding。

  2. 用上述两个text embedding LoRA fine-tune SD。

  3. ϕ LoRA fine-tune SD。

  4. 对text embedding进行插值,cfg生成。

 

DiffMorpher

DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing

DiffMorpher

 

AID

AID: Attention Interpolation of Text-to-Image Diffusion

对两端图像生成过程的cross-attention的KV进行插值,替代当前插值点生成过程中的cross-attention中的KV。

 

NoiseDiffusion

NoiseDiffusion: Correcting Noise for Image Interpolation with Diffusion Models beyond Spherical Linear Interpolation

对diffusion model generated images来说DDIM Inversion + slerp插值法效果很好,但对real images效果就不好,通过一些方法矫正noise可以解决这一问题。

 

BlackScholesDiffusion

Prompt Mixing in Diffusion Models using the Black Scholes Algorithm

  1. prompt插值生成。

 

Concept-centric Personalization

Concept-centric Personalization with Large-scale Diffusion Priors

新任务,将StableDiffusion个性化为专门生成某个概念图像的模型,和TI的区别是,该任务专注于某个更抽象的concept而非reference images中的concept,比如人脸,emphasizes fidelity and diversity in the generative results,所以需要提供至少上k的该concept的图像。

做法是将concept和其它控制条件分离,在提供的concept数据集上fine-tune StableDiffusion (全部使用null text),得到concept-centric diffusion model,使用CFG进行生成,其它控制条件也可以通过CFG引入,比如text和ControlNet。

Concept-centric-Personalization

Neural Network Diffusion

Neural Network Diffusion

parameter autoencoder + latent diffusion model

Diffusion-based Neural Network Weights Generation

 

FineDiffusion

FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes

FineDiffusion

  1. Fine-tune large pre-trained diffusion models scaling to large-scale fine-grained image generation with 10,000 categories.

  2. 新型CFG:训练和采样时使用superclass label embedding替代null embedding。

 

FactorizedDiffusion

Factorized Diffusion: Perceptual Illusions by Noise Decomposition

FactorizedDiffusion

  1. 生成illusion。

  2. 通过不同的decomposition方法(如高低频,颜色,运动等),将图像x分解为不同的分量fi(x),即x=ifi(x)。采样时对xt使用不同的prompt进行预测,提取预测结果的分量,将不同prompt想要的那个分量相加,得到最终的ϵ~